Bayesian Marlins (in the wild)
I was watching a baseball game last night and an announcer, after noting that the Marlins started the season 11-1, jokingly commented that they were on pace to win 149 games that year (11/12*162=148.5).
Anyone who knows anything about baseball knows that there is no chance that the Marlins will win 149 games. The most wins ever in a 162 games season is 121 by the Seattle Mariners (they didn’t win the World Series).
So then this got me thinking about a good way to estimate how many games they will actually win this year, which leads nicely into a conversation about frequentist vs Bayesian estimation.
A frequentist would estimate p (the Marlins winning percentage) as 11/12=.916666.
Using the Clopper Pearson interval, a 95% interval for the winning percetnage is (.7532, .99789). This would predict the Marlins will win between 122 and 162 games this year. I guarantee this will not happen. (Well I suppose its possible that they win 122 games, but if you want to bet against me my question for you is: “How much can I bet?”)
Now lets do the same thing but from a Bayesian approach. Since I do have prior information about teams winning percentage over the entire year, I should probably use that information. I know that the most wins a team has ever had in a season is 121 and the fewest wins in a 162 game season is 40 (by the 1962 Mets). That means the worst ever winning percentage was .247 and the best ever was .747. So we know that teams generally have a winning percentage of .5. We also want a distribution such that three standard deviations above .5 is approximately .747 and three standard deviations below the mean is .247. This leads nicely to a Beta(17.4,17.4). This distribution is symmetric about .5 and the 1st percentile of this distribution is .309 and the 99th percentile is .691.
So using a Beta(17.4,17.4) prior and having the likelihood be binomial with n=12 and x=11 we get a posterior distribution of Beta(28.4,18.4).
So our Bayesian estimate of the Marlins winning percentage is .607 (the mean of the posterior distribution) (This used to say .615 but Stan Devia pointed out that it should be .607). We also have a 95% credbile interval of (.465,.740) (Although here this is not an HPD).
Above is a picture of the prior (in green) and the posterior (in black) distributions.
This predicts that the Marlins will win between 75 and 120 games, a much more reasonable estimate.
Some philosophy: If this were the first baseball season ever, the frequentist approach may make sense since we would have no prior information. However, we have over 100 years of baseball seasons to draw from, so why not use that information. We know that the Marlins cannot possibly (well, it’s nearly impossible) win 149 games, so why should that extra (prior) information just be thrown out? I don’t think it should. Also, as the season moves on, the frequentist estimate and the Bayesian estimates will get closer and closer, as the data begins to dominate the prior.
A frequentist would question my choice of prior saying that you could choose any prior. That’s true, but my choice is reasonable based on the prior information that I have. So any “reasonable” prior seems acceptable.
You can argue this both ways (frequentist or Bayesian). I prefer the Bayesian approach here, but frequentist methods still have merit. I’m a little bit bothered when people take a stance one way or the other about being Bayesian or frequentist, since they are both valid. And, like the answer to most questions in life, the answer to which technique to use in a given situation is: It depends.