NESS (in the wild)

I’m at NESS right now. Two people I was expecting to see here, both decided they weren’t coming because the weather was too nice. They are much smarter than I.

Cheers.

Correlation of the week (in the wild)

This weeks correlation versus causation debacleis in the form of a letter to the editor from Lee Bradshaw and is brought to you by The Bernoulli Trial Blog.

Here is the original article.

Cheers.

Bayesian Marlins (in the wild)

I was watching a baseball game last night and an announcer, after noting that the Marlins started the season 11-1, jokingly commented that they were on pace to win 149 games that year (11/12*162=148.5).

Anyone who knows anything about baseball knows that there is no chance that the Marlins will win 149 games. The most wins ever in a 162 games season is 121 by the Seattle Mariners (they didn’t win the World Series).

So then this got me thinking about a good way to estimate how many games they will actually win this year, which leads nicely into a conversation about frequentist vs Bayesian estimation.

A frequentist would estimate p (the Marlins winning percentage) as 11/12=.916666.

Using the Clopper Pearson interval, a 95% interval for the winning percetnage is (.7532, .99789). This would predict the Marlins will win between 122 and 162 games this year. I guarantee this will not happen. (Well I suppose its possible that they win 122 games, but if you want to bet against me my question for you is: “How much can I bet?”)

Now lets do the same thing but from a Bayesian approach. Since I do have prior information about teams winning percentage over the entire year, I should probably use that information. I know that the most wins a team has ever had in a season is 121 and the fewest wins in a 162 game season is 40 (by the 1962 Mets). That means the worst ever winning percentage was .247 and the best ever was .747. So we know that teams generally have a winning percentage of .5. We also want a distribution such that three standard deviations above .5 is approximately .747 and three standard deviations below the mean is .247. This leads nicely to a Beta(17.4,17.4). This distribution is symmetric about .5 and the 1st percentile of this distribution is .309 and the 99th percentile is .691.

So using a Beta(17.4,17.4) prior and having the likelihood be binomial with n=12 and x=11 we get a posterior distribution of Beta(28.4,18.4).

So our Bayesian estimate of the Marlins winning percentage is .607 (the mean of the posterior distribution) (This used to say .615 but Stan Devia pointed out that it should be .607). We also have a 95% credbile interval of (.465,.740) (Although here this is not an HPD).

prior-posterior1
Above is a picture of the prior (in green) and the posterior (in black) distributions.

This predicts that the Marlins will win between 75 and 120 games, a much more reasonable estimate.

Some philosophy: If this were the first baseball season ever, the frequentist approach may make sense since we would have no prior information. However, we have over 100 years of baseball seasons to draw from, so why not use that information. We know that the Marlins cannot possibly (well, it’s nearly impossible) win 149 games, so why should that extra (prior) information just be thrown out? I don’t think it should. Also, as the season moves on, the frequentist estimate and the Bayesian estimates will get closer and closer, as the data begins to dominate the prior.

A frequentist would question my choice of prior saying that you could choose any prior. That’s true, but my choice is reasonable based on the prior information that I have. So any “reasonable” prior seems acceptable.

You can argue this both ways (frequentist or Bayesian). I prefer the Bayesian approach here, but frequentist methods still have merit. I’m a little bit bothered when people take a stance one way or the other about being Bayesian or frequentist, since they are both valid. And, like the answer to most questions in life, the answer to which technique to use in a given situation is: It depends.

Cheers.

prior-posterior1

More Census (in the wild)

Here is an NPR radio segment about the upcoming 2010 census.

I especially like this comment about the piece:
“It is always disheartening to witness otherwise intelligent, capable leaders engage in ideology induced idiocy as Mr. Gingrich did while talking about sampling and statistical methods.

Statistical procedures, which include simple frequency counts, can be manipulated or misinterpretted by the people who use them. The key is to have people use the tools at our disposal fairly in order to get an accurate count in 2010. But Mr. Gingrich voiced a distrust that the mathematical procedures are valid – a point that is unjustified scientifically but ideologically convenient.

Congress (and Mr. Gingrich himself) have benefited from the same “theoretical” statistical modeling and research sampling he opposes for this situation. Billions of dollars are appropriated by Congress on the trust that pilot studies, exploratory research, and statistical probabilities can be trusted – including defense projects, engineering projects, and everything NASA has ever done. More immediately, I’m sure that Mr. Gingrich takes medicines tested during research studies, not tested on everyone in the country. He trusts that the sampling is adequate in these cases… yet not appropriate for determining the population of the country.

Sad logic.”
Mike Sullivan

Well said.

Cheers.

Eugenics in the wild

As my fourth semester of grad school draws to a close, I am in the lab writing a final project for my Linear Models 2 class about a paper published by Patterson and Thompson called “Recovery of Inter-block Information when Block Sizes are Unequal.” Pretty dry stuff. But one of the citations in this article is for an article written by Nelder in 1968 (The combination of information in generally balanced designs). Then there are several citations in that article to articles written by another big name, Yates. He published three articles about incomplete blocking designs between 1936 and 1940. The journal on the citation is Ann. Eugen.. I was searching for this journal on the internet, and I couldn’t find any current issues. It seemed to have stopped publishing about 1950.

The reason? The name of the journal was Annals of Eugenics. I thought nothing of this at first, then I thought, “Wait. Is Eugenics what I think it is?”. It was. (If you don’t know click here.)

Crazy. So then I looked up the history of this journal. Apparently, it was started by Karl Pearson. Yeah, Karl Pearson. The guy who is famous for Pearson’s correlation coefficient and many, many other widely used statistical methods still in use today. This is the same guy who also founded the journal Biometrika, which is one of the most prestigious statistics journals today. As one of my professors would say, “He is a god in statistics.”

That led me to this great blog post (from a great blog, The Lay Scientist) about the journal, Annals of Eugenics. Please read this. This is crazy. Eugenics was a well respected area of science less than 100 years ago.

I am rapidly becoming more an more interested in the history of statistics than I am in actual statistics.

Anyway,
Cheers.

Graphical Display of Movies (in the wild)

Here is a graphical representation of box office receipts for the movies from 1986 through 2008.

This is a great display of data. We can see how much money individual moves made, as well as the length of time they were in theater. We can also easily see the overall trend of the movie industry over time and seasonally.

Cheers.

Sampling, Politics, and the Census (in the wild (Part II))

I’ve recently become fascinated by politics of the Census, and, with a Census coming up in 2010, I think it’s a perfect topic for StatsInTheWild.

Politics and the census have been an issue for decades. Here is a piece from the New York Times from August 1909. In it, they cite a letter from President Taft to the Secretary of Commerce and Labor, requesting that politics be removed from the process of the Census. However, it looks as if Taft’s pleas for a non-partisian Census are not being heeded by todays politicians.

Recently, I posted an article about Obama nominating a new head of the Census who is an expert in sampling which has ruffled Republican feathers. Why would republicans be against sampling, a process that makes the Census more accurate and, ultimately, less expensive? Well, here is a very good article from 1998 explaining the Republican opposition to using sampling techniques for the Census.

The basic gist of the article is that government officials from the Clinton administration wanted to use sampling methods to account for the traditional undercount of minorities. Republicans likely blocked this from happening because the people who the Census tends to miss are people who tend to vote Democrat. Republicans (led by former speaker Newt Gingrich and Bobb Barr) argued that sampling was unconstituational and was a violation of the Census Act. Their argument held up in court, and no sampling was allowed to be used for the allocation of federal funds or redistricting.

Here is a good quote from the article if you don’t want to read the whole thing:
“The Bureau wants to do better. By using sophisticated data-gathering and statistical-sampling techniques to augment the direct count, it believes it can reduce the total and differential undercounts and save money to boot. The National Academy of Sciences agrees, as does about every statistician worth his or her salt. In 1990, the Census Bureau thought such sampling were the way to go, but Republican officials in the Bush Administration overruled the Bureau’s experts. The courts refused to intervene. ”

Cheers.

Sampling, Politics, and the Census (in the wild)

Obama to nominate sampling expert to head census

Almost last NCAA picks (in the wild)

I only got one of the Final Four teams this year, but that team is UNC and they look like they have a good shot to win it. It’s never a total loss when your champion pick is still alive.

My bracket might stink, but my “bets” have been good. For the tournament I am 8-5 with a 19.82% return per bet. Here are 4 more.

MSU +160
Villanova +280

MSU +650 to win championship
Villavona +700 to win Championship

**********
After the championship game, I finished 9-8 with a 9.05% return per bet. As pointed out in the comments this is a very small sample size. I agree. So I want to make it clear that I am not using this as evidence that I can “beat” the sports book. I am merely reporting my results.

I think I can beat the sports books, but I’d need to be profitable over several hundred bets to begin to approach significant evidence that I am winning consistantly.

Thanks for the comment.

Cheers.

Made Up Statistics in the Wild

Here is an interesting article from the UK:
“Numbers up: The truth about statistics”

And in the spirit of correlation of the week, here is a good excerpt from that article:

“Toothless post-menopausal women are three times as prone to hypertension as those with teeth”

This news, reported in the respected journal Hypertension, might have lead to queues of denture-wearing women of a certain age at GPs’ surgeries. A study by Japanese researchers from Hiroshima University, published in 2004, suggested that tooth loss in post-menopausal women was directly linked to high blood pressure, which can increase the risk of heart disease or strokes.

But a look past the headlines revealed a problem: the scientists based the conclusion on a study of just 98 post-menopausal women – 67 with missing teeth, and 31 with their gnashers intact. In statistical terms, that is an almost insignificant sample size.

The problem is that the apparent cause of a link can sometimes be pure chance. The smaller the sample, the more likely this becomes. One statistician famously managed to find a statistically significant correlation (in a small enough sample) between birth rates in various European countries and the stork population, suggesting the birds therefore really do deliver babies.

McConway’s verdict: “There’s no standard minimum group size for statistical studies – it depends what you’re measuring. If it’s something that doesn’t vary much – say, blood pressure in elite athletes – you could get away with a smaller group. But for something like this, you need a much larger sample.”

Cheers.