Category Archives: Uncategorized
Bayesian Marlins (in the wild)
I was watching a baseball game last night and an announcer, after noting that the Marlins started the season 11-1, jokingly commented that they were on pace to win 149 games that year (11/12*162=148.5).
Anyone who knows anything about baseball knows that there is no chance that the Marlins will win 149 games. The most wins ever in a 162 games season is 121 by the Seattle Mariners (they didn’t win the World Series).
So then this got me thinking about a good way to estimate how many games they will actually win this year, which leads nicely into a conversation about frequentist vs Bayesian estimation.
A frequentist would estimate p (the Marlins winning percentage) as 11/12=.916666.
Using the Clopper Pearson interval, a 95% interval for the winning percetnage is (.7532, .99789). This would predict the Marlins will win between 122 and 162 games this year. I guarantee this will not happen. (Well I suppose its possible that they win 122 games, but if you want to bet against me my question for you is: “How much can I bet?”)
Now lets do the same thing but from a Bayesian approach. Since I do have prior information about teams winning percentage over the entire year, I should probably use that information. I know that the most wins a team has ever had in a season is 121 and the fewest wins in a 162 game season is 40 (by the 1962 Mets). That means the worst ever winning percentage was .247 and the best ever was .747. So we know that teams generally have a winning percentage of .5. We also want a distribution such that three standard deviations above .5 is approximately .747 and three standard deviations below the mean is .247. This leads nicely to a Beta(17.4,17.4). This distribution is symmetric about .5 and the 1st percentile of this distribution is .309 and the 99th percentile is .691.
So using a Beta(17.4,17.4) prior and having the likelihood be binomial with n=12 and x=11 we get a posterior distribution of Beta(28.4,18.4).
So our Bayesian estimate of the Marlins winning percentage is .607 (the mean of the posterior distribution) (This used to say .615 but Stan Devia pointed out that it should be .607). We also have a 95% credbile interval of (.465,.740) (Although here this is not an HPD).

Above is a picture of the prior (in green) and the posterior (in black) distributions.
This predicts that the Marlins will win between 75 and 120 games, a much more reasonable estimate.
Some philosophy: If this were the first baseball season ever, the frequentist approach may make sense since we would have no prior information. However, we have over 100 years of baseball seasons to draw from, so why not use that information. We know that the Marlins cannot possibly (well, it’s nearly impossible) win 149 games, so why should that extra (prior) information just be thrown out? I don’t think it should. Also, as the season moves on, the frequentist estimate and the Bayesian estimates will get closer and closer, as the data begins to dominate the prior.
A frequentist would question my choice of prior saying that you could choose any prior. That’s true, but my choice is reasonable based on the prior information that I have. So any “reasonable” prior seems acceptable.
You can argue this both ways (frequentist or Bayesian). I prefer the Bayesian approach here, but frequentist methods still have merit. I’m a little bit bothered when people take a stance one way or the other about being Bayesian or frequentist, since they are both valid. And, like the answer to most questions in life, the answer to which technique to use in a given situation is: It depends.
Cheers.

Eugenics in the wild
As my fourth semester of grad school draws to a close, I am in the lab writing a final project for my Linear Models 2 class about a paper published by Patterson and Thompson called “Recovery of Inter-block Information when Block Sizes are Unequal.” Pretty dry stuff. But one of the citations in this article is for an article written by Nelder in 1968 (The combination of information in generally balanced designs). Then there are several citations in that article to articles written by another big name, Yates. He published three articles about incomplete blocking designs between 1936 and 1940. The journal on the citation is Ann. Eugen.. I was searching for this journal on the internet, and I couldn’t find any current issues. It seemed to have stopped publishing about 1950.
The reason? The name of the journal was Annals of Eugenics. I thought nothing of this at first, then I thought, “Wait. Is Eugenics what I think it is?”. It was. (If you don’t know click here.)
Crazy. So then I looked up the history of this journal. Apparently, it was started by Karl Pearson. Yeah, Karl Pearson. The guy who is famous for Pearson’s correlation coefficient and many, many other widely used statistical methods still in use today. This is the same guy who also founded the journal Biometrika, which is one of the most prestigious statistics journals today. As one of my professors would say, “He is a god in statistics.”
That led me to this great blog post (from a great blog, The Lay Scientist) about the journal, Annals of Eugenics. Please read this. This is crazy. Eugenics was a well respected area of science less than 100 years ago.
I am rapidly becoming more an more interested in the history of statistics than I am in actual statistics.
Anyway,
Cheers.
Graphical Display of Movies (in the wild)
Here is a graphical representation of box office receipts for the movies from 1986 through 2008.
This is a great display of data. We can see how much money individual moves made, as well as the length of time they were in theater. We can also easily see the overall trend of the movie industry over time and seasonally.
Cheers.
Almost last NCAA picks (in the wild)
I only got one of the Final Four teams this year, but that team is UNC and they look like they have a good shot to win it. It’s never a total loss when your champion pick is still alive.
My bracket might stink, but my “bets” have been good. For the tournament I am 8-5 with a 19.82% return per bet. Here are 4 more.
MSU +160
Villanova +280
MSU +650 to win championship
Villavona +700 to win Championship
**********
After the championship game, I finished 9-8 with a 9.05% return per bet. As pointed out in the comments this is a very small sample size. I agree. So I want to make it clear that I am not using this as evidence that I can “beat” the sports book. I am merely reporting my results.
I think I can beat the sports books, but I’d need to be profitable over several hundred bets to begin to approach significant evidence that I am winning consistantly.
Thanks for the comment.
Cheers.
Made Up Statistics in the Wild
Here is an interesting article from the UK:
“Numbers up: The truth about statistics”
And in the spirit of correlation of the week, here is a good excerpt from that article:
“Toothless post-menopausal women are three times as prone to hypertension as those with teeth”
This news, reported in the respected journal Hypertension, might have lead to queues of denture-wearing women of a certain age at GPs’ surgeries. A study by Japanese researchers from Hiroshima University, published in 2004, suggested that tooth loss in post-menopausal women was directly linked to high blood pressure, which can increase the risk of heart disease or strokes.
But a look past the headlines revealed a problem: the scientists based the conclusion on a study of just 98 post-menopausal women – 67 with missing teeth, and 31 with their gnashers intact. In statistical terms, that is an almost insignificant sample size.
The problem is that the apparent cause of a link can sometimes be pure chance. The smaller the sample, the more likely this becomes. One statistician famously managed to find a statistically significant correlation (in a small enough sample) between birth rates in various European countries and the stork population, suggesting the birds therefore really do deliver babies.
McConway’s verdict: “There’s no standard minimum group size for statistical studies – it depends what you’re measuring. If it’s something that doesn’t vary much – say, blood pressure in elite athletes – you could get away with a smaller group. But for something like this, you need a much larger sample.”
Cheers.
Predator X in the Wild
I was watching the show “Predator X” on the History Channel tonight. Apparently, they discovered this fossil of an enormous aquatic predator. It’s pretty awesome.

Here is a description of the bite force of this predator: (from here)
“At St Augustine Alligator Farm and Zoological Park in Florida , Dr. Hurum assisted evolutionary biologist Dr. Greg Erickson from Florida State University in calculating the bite force of this colossal creature. The jaws held in place a set of trihedral teeth, each measuring 12 inches, which clamped down on prey with an estimated 33,000lbs of bite force. The calculation is one of the largest bite forces ever calculated for any creature. Predator X would have had a bite a bite force was more than ten times the bite force of any animal alive today and four times the bite force of a T- Rex.”
(Here is a link of average bite forces for humans and a few selected animals.)
At this point you might be saying, “That IS awesome. But what does it have to do with this blog?” An astute observation. Well…
They estimated that the bite force of the predator was 33,000 pounds. The way they estimated this was by taking measurements of the bit force of different sized crocodiles (or alligators, I can never tell the difference.) Then they plotted the data in a scatter plot, weight of crocodile versus bite force. There was a clear postive relationship between bite force and size of the animal. Then they fit a simple regression line through the data and extrapolated how much bite force a 50 foot long animal that weighed an estimated 45 tons could pack in its bite force. That’s how I believe they came up with there estimated bite force. I’ll give them the benefit of the doubt and assume they did more than that to come up with the estimate but they didn’t want to show the details on the History channel. (If you have details on how they estimated the bite force, please send them my way.)
Let’s assume that all they did was extrapolate this simple regression line. What would be the problem with that? The problem is that they are extrapolating the linear trend outside of their domain. There is no guarantee that the bite force trend remains linear as the weight approaches the estimated 45 tons. They collected their data on animals with weight of crocodiles which can be up to 1.5 tons. It seems naive to assume that the linear trend will continue as you increase the weight of an animal.
Here is a simple example of why extrapolating outside of your domain is a bad idea.
Say you collected data of children’s heights and weights and you fit a regression line through the data. You’ll surely observe a postive relationship between age and height. As children get older, their heights generally increase. This increase can be approximated by a roughly linear trend say between the ages of 10-18. Also, say that we find that children grow on average of 1 inch per year between 10-18. If I were then to predict the height of a person by extrapolating out this trend I would assume that a 48 year old would be, on average, 30 inches taller than an 18 year old. Clearly, this is not true.
So just because a trend is linear over a certain domain does not mean that that linear trend continues outside of the tested domain.
Cheers.
Choosing your bracket in the wild.
Check out page 35 of the old issue of Chance magazine which has an article talking about the best way to pick NCAA basketball teams to win your office pool. Of course this probably would have been more of a help if I had posted it five days ago, but it’s still interesting. And you can use the advice next year.
Cheer
ENAR in review in the wild
So I’m back from ENAR and back from spring break. I’ve ben greeted back to grad school by a midtern on Friday night from 6-8pm. What a fun time for a midterm!
Let me first start by saying that San Antonio is awesome. The river walk is great. I at at two great Mexican restaurants for lunch two days in a row and they were both incredible. And I drank Shiner Bock the whole time, which I highly recommend.
Anyway, on Sunday night I presented a poster at ENAR (they served Shiner Bock during the poster presentations) about a paper that we wrote (and recently published) about synthetic data with binary variables. I met some very interesting people who stopped by my poster. One guy who stopped by informed me that he coined the term predictive mean matching (which i referenced on my poster). So, I asked him who he was, and he told me he was Rod Little (He’s kind of a big deal). He wrote the book on multiple imuptation: Little, R.J.A. & Rubin, D.B. (1987). Statistical Analysis with Missing Data. New York:
John Wiley. So that was kind of neat. (I just visited Rod Little’s website and apparently this is the “most useful of all links”. (Here is another interesting article called “Calibrated Bayes: A Bayes/Frequentist Roadmap”.)
The next day I spoke with some people from SAS and STATA, as well as, some recruiters from Smith-Hanley (who I got my first job through) and Cambridge Group. The SAS people told me about a product call JMP, which I was very impressed by it. The STATA people told me that I could buy a student STATA license for like $55 dollars and then use it commerically after I graduate. (As opposed to several thousand dollars for a SAS license that only lasts a year.) And I could use it for as long as I wanted to. The only thing I would have to pay fo rwould be upgrades. So STATA has that going for them. I am definately gonig to try it out.
Cheers.
March Madness in the Wild
I was sitting in my office last semester and this really tall guy walked past my door. I kind of did a double take he was so tall. A few seconds later he came back and knocked on my door. He asked if my office mate was there so he could hand in some homework. I told him that he wasn’t around, but I could take the homework and give it to him when I saw him. He thanked me and left. It was Hasheem Thabeet. So in honor of Mr. Thabeet handing in his STAT 1100 homework to me, I have started a Yahoo! tourney pick’em group.
Join the Stats in the Wild Yahoo! Tournament Challenge. Join group #166142. Good luck.
Every year in March I start to care about college basketball. I even build some half-assed model’s to do my bracket. I finished in the 99th percentile of all yahoo brackets last year by managing to pick the champion, the two finalists, and all four final four teams. (My goal for a long time was to pick all four of the final four. I’ve been stuck on three out of four since high school.) Anyway, in order to make my predictions, I build several models and then I combine the results.
Here’s some thoughts on the tournament based on those models:
Most overrated teams in the tournament:
1. LSU: The SEC is a great football conference and a mediocre basketball conference, (at least this year) and LSU is leading the charge of mediocre teams in this conference. LSU only wins their first round game because they play Butler. Speaking of Butler…..
2. Butler: They lost to Wisconsin Green Bay, Wisconsin Milwaukee and Loyola Chicago. Then they lost to Cleveland State in their tournament. Although in their defense they did beat IU-South Bend 87-33. They played two top 25 teams all year and they went 1-1. Ohio state (who is no longer in the top 25) and Xavier (another over rated team). Speaking of Xavier…..
3. Xavier: Look. I’m a Umass fan, (from the days of Roe, Camby, Dana Dingle, Donta Bright, Derek Kellogg) but the A-10 stinks. Xavier deserves to be in the tourney, but as 4 seed? I doubt it. This is the team the closed it’s regular season 4-4 with losses to Duquesne, Dayton, Charlotte, and Richmond, then lost to Temple in their conference tournament. I’m not exactly inspired by the musketeers. Florida State beats them in the second round by 10.
4. Dayton: An at large bid? So Dayton is better than St. Mary’s, Penn State, Florida, Auburn, Creighton, and Miami? That’s what the selection committee is saying.
5. Boston College: USC should dispose of them in the first round. USC by 12.
Most Underrated teams in the tournament
West Virginia: I’ve got West Virginia as a Sweet sixteen team. They are going to dismantle Dayton in the first round, then beat Kansas by 2. Then they have a good chance against Michigan State for a shot at the Elite 8.
Wisconsin: I think they got screwed with a 12 seed, and then by drawing a very good FSU 5 seed. If they can get past this game, they should cruise into the sweet sixteen past Xavier.
First round upsets: USC over BC.
Most likely 13 seed to win first round: Cleveland State
Most likely 12 seed to win first round: Arizona
Most likely 11 seed to win first round: Utah State
Most likely 10 seed to win first round: USC
Most likely 9 seed to win first round: Butler (Only because LSU stinks too.)
Biggest 3 seed blowout: Missouri
4 seed: Washington
5 seed: Illinois
6 seed: West Virginia
7 seed: Cal
8 seed: Ohio State
First number 1 seed gone: UConn
Good first round bets:
Washington -220
FSU -145
Utah -110
Illinois -200
Arizona St -200
Michigan +190 (This price is fantastic)
Texas A and M +115
Oklahoma St +115 (Seriously? They’re an inderdog here?)
Ohio St -160
And finally, the offical final 4 picks of Stats in the Wild (Drumroll Please):
UNC, Pitt, Memphis, and Lousiville with UNC beating Memphis in the finals 76-75.
Also, I am picking San Diego State to win the NIT with a 72-67 win over Miami in the finals.
And for my boys in the NIT…..what? Thy’re not even in the NIT. Bring back Calipari!
Go.
Go U.
Go U – Mass.
Go Umass!
Cheers.
ENAR in the wild
I’m sitting at Bradley International Airport waiting for a flight to Raleigh/Durham. Then I’ll hop on a connector to Memphis, followed by another connector to San Antonio, Texas. What’s in San Antonio, you might ask? The Alamo? Tim Duncan? Well yes, but why would I post that on a stats blog?
The reason I am headed to the Lone Star state is for the Eastern North American Region (ENAR) (pronounced EE-NAR) of the International Biometric Society (IBS). (IBS is an unfortunate acronym in itself, but I should note that the Western North American Region (WNAR) is pronounced WEE-NAR. That is unfortunate. I am also extremely childish.) I’ll be there from tonight until Monday afternoon, so I can attend a few talks in the morning before I leave.
On SUnday night from 8-11pu, I’m giving a poster presentation about synthetic data related to the paper I just published. It’s like a student poster bonanza. And they serve alcohol. So it should be interesting.
I’d love to post the abstracts to some of the talks I am interesting in attending, but Bradley’s internet connection is not letting me view the abstract page.
If Raleigh/Durham or Memphis has a better connection, I’ll post them when I get there.
Cheers.