Stats in football (in the wild)
Cassell to Moss. TOUCHDOWN. With only seconds left the Patriots had completed a drive started deep in their territory to pull within one point of the Jets. They kicked the extra point, went to overtime, and lost. (After having the Jets at 3 and 17.) If the Patriots had gone for two after the touchdown, they could have won the game right there. So should Belicheck have gone for two?
Endgame Technologies has developed a simulator for football games called ZEUS. According to these simulations Belicheck should have gone for two at the end of regulation instead of kicking the extra point and sending the game to overtime.
I’ve been interested in decision making in football for a long time, especially the decision to go for two points after a touch down instead of kicking the extra point. The article “Refining the Point(s)-after-touchdown decision” by Harold Sacrowitz is an excellent article on the subject. In his results he develops a table for when to go for two or kick the extra point in order to maximize a teams chances of winning.
More recently, an article “Do firms maximize? Evidence from professional football” by David Romer, investigates NFL teams decision about going for it on fourth down. He argues that NFL teams are kicking (punting and going to field goals) too often and they would increase their probabilty of winning by going for it on fourth down more often.
And here is a guest post by Ian Ayres on the Freakonimcs blog asking the question “Why don’t sports teams use randomization?”
Cheers.
Stats in Baseball: Part 1 (in the wild)
I like stats. I also like baseball. So what could I love more than baseball statistics.
href=”http://en.wikipedia.org/wiki/Bill_James”>Bill James. He is widely considered to be the father of baseball statistics, or sabermetrics.
Bill James came up with a whole bunch of very clever ways to analyze baseball using statistics, including runs created, range factor, and win shares.
Another one of his stats is Pythagorean expectation:

This statistics works very well for predicting wins, but it doesn’t really make an sense. Why does it work? I was wondering how this statistics would compare with a multiple regression based on the same data.
Using the 2008 MLB baseball data of wins, runs scored, and opponent runs scored by team. I compared the predictions for expected wins made by Pythagorean expectation versus a simple multiple regression model of the form Wins=Runs scored+Opponent runs scored+error. The root mean squared error for the pythagorean expectation was 7.37. The root mean squared error for prediction for the regression model was 4.18. While Pythagorean expectation does a very good job predicting win percentage, a multiple linear regression does a better job. Also, the regression model has coefficients that can be interpreted practically while Pythagorean expectation works well, but offers very little reasoning as to why it works well.
The model was: predicted wins=79.7416+.1025*Runs scored-.1088*Opponent runs scored. This model has R-squared=.8528. So on the average, approximately every extra ten runs a team scores is worth a win and every extra ten runs given up is equal to a loss.
What happens if you build a regression model with more than just runs and opponent runs as predictor variables. Using the same 2008 data, a model for predicting wins in 2008 is:
Predicted wins=60.75-71.15*WHIP+.11768*SB+271.133*OBP+.10984*HR
It seems that you can break down winning baseball games into four factors:
1.) Pitching
2.) Speed
3.) Contact hitting
4.) Power hitting
I realize that’s not a shocking revelation, but it’s neat to see it even with this small data set.
I was a little bit surprised to see that SB shows up because a common theory is that stealing bases is not worth the risk, but it shows up very strongly in this model.
So I looked it up:
Top 5 teams in SB for 2008
1.) Tampa Bay Rays
2.) Colorado Rockies
3.) New York Mets
4.) Philadelphia Phillies
5.) Los Angeles Angels
Bottom 5 in SB in 2008
30.) San Diego Padres
29.) Pittsburgh Pirates
28.) Arizona Diamondbacks
27.) Atlanta Braves
26.) Detroit Tigers
3 of the top 5 and 5 of the top 7 teams made the playoffs. Interesting.
I’ll end with a quote from the greatest base stealer of all time: “It took a long time, huh? [Pause for cheers] First of all, I would like to thank God for giving me the opportunity. I want to thank the Haas family, the Oakland organization, the city of Oakland, and all you beautiful fans for supporting me. [Pause for cheers] Most of all, I’d like to thank my mom, my friends, and loved ones for their support. I want to give my appreciation to Tom Trebelhorn and the late Billy Martin. Billy Martin was a great manager. He was a great friend to me. I love you, Billy. I wish you were here. [Pause for cheers] Lou Brock was the symbol of great base stealing. But today, I’m the greatest of all time. Thank you.
—Rickey Henderson’s full speech after breaking Lou Brock’s record
Cheers
Presidential Voter Turnout Trends (in the wild)
Voter turnout for this last election was as high as it has been in the last 60 years. Click here for a good graphical display of voter turnout since 1948 on Andrew Gelman’s blog.
Also check out the United States Election Project where they have data about past election going back about 50 years.
Stats in World War II (in the wild)
A friend of mine told me about this problem, so I went and looked it up. This is stats in the wildest of the wild.
So, during World War II, the Allies were trying to estimate the number of a certain kind of German tank. They needed this information to better plan their attacks and invasions. There were two sets of estimates made, one by intelligence and another by a group using statistical methods.
Estimates made using statistical methods in June 1940, June 1941 and August 1942 of the number of a certain type German tank were, respectively, 169, 244, and 327. The intelligence estimates for each of those same three periods of time were, respectively, 1000, 1550, and 1550. (from :Number of German tanks)
These estimates are drastically different, and depending on which estimate was believed, it is possible that battle plans may have been significantly affected. So who made the better estimates?
In most situations when we estimate something, we can never actual know what the true value is. However, as it turns out, after the war was over, German records became available and the actual number of tanks that they had at each of those three points in time became available. The actual number of tanks that the Germans had at the three points in time (June 1940, June 1941 and August 1942) when the estimates were made were, respectively, 122, 271, and 342. (Recall that the statistical estimates were 169, 244, and 327 for those three time periods.) The statistical estimates are astonishingly close. (As well as the intelligence estimates being alarmingly inaccurate.) So how did they do it?
The statistical group looked at the serial numbers of tanks that had been captured or destroyed by Allied troops, and they assumed that the serial numbers of the tanks were ordered from 1 to T where T is the number of tanks that the Germans had. So they assumed that if the Allies found a tank with serial number 200, that the Germans had at least (and almost surely more than) 200 tanks.
So if we assume that each serial number has equal probability of being observed our maximum likelihood estimate (our best guess) of T is simply the maximum serial number that we encounter on a destroyed tank. However, using the maximum encountered serial number to estimate T turns out to be an unbiased estimator. (If we always used the largest serial number as our estimate of T, we would be systematically underestimating T, because our largest observation is usually not the actual largest value.) So what we need is an unbiased estimator for T.
As it turns out the expected value of our estimator of T (the maximum observed serial number) is n/(n+1)*T (hence biased). So on the average the largest observed value will be smaller than actual T. To correct for this we simply multiply the largest observed value by (n+1)/n. This will give us an unbiased estimate for the number of tanks the Germans had, and this is how they reached their statistical estimates.
Example:
Say we observe 50 tank serial numbers and the largest observed serial number is 245. With all of the above assumptions, our unbiased estimate as to the number of tanks is 51/50*245=249.9.
If we observe 25 serial number and the largest is 110, our best guess is 26/25*110=114.4.
Here is a link to another blog post about the German tank problem.
Modern note: I saw online that someone was using this approach to try to estimate the number of servers that Google has. (More to come on that)
References:
Ruggles, R., and Brodie H. (1947) An empirical intelligence in World War 2. Journal of the American Statistical Association, 42:72-91.
Goodman, L. A. (1954), “Some Practical Techniques in Serial Number Analysis,”
Journal of the American Statistical Association, 49, 97–112.
Speeling erroes kil blogers kredibilty (in the wild)
Accoriding to this article from www.readwriteweb.com, they claim that “Errors By Bloggers Kill Credibility & Traffic, Study Finds”. Interesting. So how did they reach this conclusion?
From the article:
“The company [goosegrade.com] asked a demographically diverse group of respondents on Amazon’s Mechanical Turk website to fill out the survey and published the results today on the goosegrade.com company blog. The bulk of respondents spent some time reading blogs but were people who remained dependent on ‘mainstream sources’ for most of their news.”
(For an explanation of Mechanical Turk, the Wikipedia article is here.
Comment: How does goosegrade know these people were demographically diverse? The only people they asked were Mechanical Turk workers. That seems like a very specific group of people. So you should only be able to make inference about that group of people. They hardly speak for internet users in general, but goosegrade.com uses them to make inference about “internet users” when they should just be making inference about “mechanical turk workers who are being paid by gooseGrade.com”. Those two groups are drastically different.
gooseGrade.com says on their site (http://www.goosegrade.com/reader-perception-survey-results)
“Readers want gooseGrade. Here’s proof.
175 People polled.
ABSTRACT: It appears that grammar, spelling, factual, and other errors do affect reader opinion as well as how likely they are to share or link to an article. These errors also seem to dictate the readers opinion of the author’s skills as a writer. 65.86% of internet users say that a tool like gooseGrade would increase their confidence in the content they are reading. Filtering further shows that 9 out of 10 newspaper readers say that a tool like gooseGrade would increase their confidence in author’s content. This merrits further investigation of newspaper readers and could show a path for new media to take more market share.”
As I said before, I’m not sure the opinions of 175 (more on this below) mechanical turk workers are sufficient to make inference on all internet users. Furthermore, remember that all of these respondents were paid by goosegrade.com (although it was probably only a few cents.)
A note on their sample size: They claim a sample size of 175 internet users, but an examination of the raw data shows that there are only 161 unique IP address. 9 IP addresses are repeated twice and 1 IP address is repeated 5 times. These should be thrown out of the sample because it is likely that they are the same person.
The readwriteweb.com article concludes with:
“Below are a few of the charts, you can see the rest on the GooseGrade blog. The lesson here? It seems pretty clear. We bloggers are harming our own credibility and traffic with our inattention to details, not just in the facts, but in the basics of our writing. Let’s do better!”
Here is a promise I am willing to make. I’ll write better and make less grammatical errors if you apply statistics more fairer. (LOL)
Cheers.
Bradley Effect (in the wild)
The Bradley effect is a supposed effect that Black candidates poll numbers are often higher than the percentage they win in the actual election. One explanation that is offered is that survey espondents do not want to appear racist or respond with an unpopular answer and thus skew the results.
Articles about this appear at salon.com and fivethirtyeight.com.
I agree with the sentiment of Nate Silver, but I think his analysis is lacking. A better analysis of the problem is found here in a paper written by Postdoctoral Fellow Daniel Hopkins.
Hopkins concludes in his paper:
“The Wilder effect occupies an unusual position in our thinking about American elections, as it is often invoked (e.g. Elder, 2007; Lanning, 2005) but rarely scrutinized. By analyzing Senate and Gubernatorial elections between 1989 and 2006, this paper has provided the first largesample test of the Wilder effect. In the early 1990s, there was a pronounced gap between polling and performance for black candidates of about 2.3 percentage points. But in the mid-1990s, that upward bias in telephone surveys disappeared. At a time when scholars are increasingly across states.
At a time when scholars are increaseingly concerned about the validity of phone surveys, these results provide some reassurance. We have also seen that the polling-performance gap is closely related to a candidate’s level of preelection support, meaning that we should not naively attribute the entire observed gap in a given election to racial bias. Douglas Wilder, David Dinkins, and Tom Bradley were all frontrunners, and so all could have expected a small decline in their election day performance. Even
female front-runners should expect declines into election day, although they are not subject to any Wilder-style bias.” – Hopkins (2008)
Sort of unrelated note:
Hopkins in his article says, “Even Tennessee’s 2006 Democratic nominee for Senate, Harold Ford Jr., experienced no Wilder effect after a negative television advertisement targeting him cued anxieties about inter-racial sex.9” The footnote number 9 is, “9. Specifically, a white actress in the late-October advertisement exclaims ‘I met Harold at the Playboy party’ and then closes the advertisement by winking and saying, ‘Harold, call me.'” Amazing.
Which hospital (in the wild)?
Say there are two hospitals in your area well call them General Hospital and Specific Hospital. (What?)
Ok. We’ll actually call them A and B. Hospital A starts a marketing campaign and they want to target specific groups of people. They run two ads. The first says, “If you’re under 65 we have a lower death rate than hospital B.” On another channel (probably CBS) they run another ad sayings, “If you’re over 65, we have a lower death rate than hospital B.”
Hospital B goes out and crunches the numbers and they find that, yes, in fact Hospital A is telling the truth. Hospital B then starts it’s marketing campaign with an ad targeted to everyone saying, “Our hospital has a lower death rate than hospital A.” Hospital A crunches the numbers and finds that hospital B is in fact telling the truth.
How can this be? Here is one scenario:
Hosptial A’s stats are:
Under 65: 49 deaths in 100 patients
Over 65: 100 deahts in 300 patients
Hospital B’s stats are:
Under 65: 50 deaths in 100 patients
Over 65: 170 deahts in 500 patients
So this means for people under 65, hosptial A’s death rate is .49 comapred to .5 for hospital B. For Over 65, hospital A’s death rate is .3333 verus hospital B’s death rate of .34. So for each of the two age groups hospital A has the lower death rate. But the overall death rate for hospital A is .3725 and the overall death rate for hospital B is .3666. So hospital B has a lower overall death rate.
Which hospital would you rather visit?
Cheers.
I can make up correlations too (in the wild).
Hem Line index updated
This is an article from the New York Times updating the hemline index. The hemline index, first proposed by economist George Taylor in the 1920s, is a proposal that says during hard economic times women’s hem lines get longer. Shorter skirts = Better economy.
The author is looking to update the hemline index to more modern times.
She finds some interesting correlations in her research. (The author explicitly mentions that “the causal link is elusive”)
Some indices that are suggested include:
During hard economic times:
Laxative sales go up
Sales of tobacco, carbonated drinks, and eggs go down
Sales of rice, beans, grains, and pasta go up (dry goods)
Property crime rises
people tend to head back to school
During good economic times:
Deodorant sales go up
Sales of lettuce, steak, and fruit go up (perishable items containing water)
She also notes that candy, beer, and pasta are recession proof.
I find all that to be very interesting.
I also find this to be interesting, but for entirely different reason. At the beginning of her article she talks about Terry Pettijohn II, a professor of psychology at Coastal Carolina University. Professor Pettijohn is a professor of psychology who studies how “economic and social factors shape preferences in popular music, movie stars, and Playboy models.” This is his job. He concludes that during hard economic times people like songs that are “longer, slower, and with more meaningful themes.” In another article he concludes that during hard economic times Playboy’s playmate of the year appears to have a “more mature appearance”. (ie older, heavier, taller, and less curvy.) He also finds that in hard economic times American movie stars tend to have “small eyes, large chins, and thin faces” and thus “a more mature appearance.”
These correlations, which may exist, and may in fact be strong correlations, seem a little shaky to me.
But if that’s the game you want to play then let’s have at it. Since 1918 the red sox have won the World Series in 3 years. Since 1918 I have been in grad school for 4 years. The correlation between me being in grad school and the red sox winning a World Series is .65403. This is statistically significant to the .0001 level. (That means it is VERY statistically significant.) I like that game.
So, I’m not sure I buy any of Professor Pettijohn’s correlations (Of course I am making that judgement without reading any of his papers, so I am being a little unfair.), but I do know that this man is smarter than I am because he has convinced someone to pay him to look at Playboy centerfolds. Truly he is living out his boyhood dreams. Congratulations Professor.
Cheers.
Fivethirtyeight.com (in the wild)
Presidential predictions and stats galore: www.fivethirtyeight.com
You can’t spell causal without ACLU……(in the wild)
Stats in the wild: AP article about racial profiling in LA, ACLU press release, and full report by Ian Ayres.
Ayres finds that minorites, including African-Americans and Hispanics, are stopped and searched at disproportionately high rates. He is quoted in the AP article: “The results of this study raise grave concerns that African-Americans and Hispanics are over-stopped, over-frisked, over-searched, and over-arrested,” said report author Ian Ayres, a Yale Law School economist and professor.
Some observations:
1.) It appears that African-Americans and Hispanics are definately “over-stopped, over-frisked, over-searched, and over-arrested”. But is it because of their race? This is hard to prove. We have correlation, but not causation. To do this we would need to randomly assign people to live across the city and randomly assign socio-economic stauses to each person. Clearly, this cannot be done. In reality, often certain races live together in neighborhoods and often have similar socio-economic statuses which can confound the analysis.
2.One statistic he offers and uses in his analysis is “stops per resident of a certain race”. He says, “African Americans were much more likely to be stopped than non-minorities. In the single-year of data, there were more than 4,500 stops for every 10,000 African Americans residents but only 1,750 stops for every 10,000 non-minority residents. In two divisions (Central and Hollywood), there were more stops of African Americans in one year than there were African American residents, meaning that the average number of stops per resident was greater than one.12 See Table 1.”
Is stops per resident a good metric for testing racial disparity? If there is no racial disparity, should we assume that the “stops per resident” would be about the same for each race? If there is an area where resdients are mostly White or mostly African-American, any stop in that area will affect this measure greatly. What we really want is something like “stops per driver” because the demographics of the drivers may be different than the residents. This isn’t hard to believe.
For example, say that in a certain area there are 100,000 residents. 90,000 are white and 10,000 are black. Now say there is a mall in this area and plenty of people drive in from surrounding area. Further assume that there are an equal number of white and black drivers on the road. Now say, that in a given year, cops stop 1000 people, 500 white and 500 black. The stop rates per 10,000 resdients for whites is about 56 per 10,000 and for blacks it is 500 per 10,000. This may not be the case in LA, but this relatively simple example shows how this metric could easily lead to skewed results.
3.) The regression that is done is using a rate as a response variable. This would lend itself nicely to logistic regression, which may be more appropriate.
Note: I have no affiliation with the LAPD or ACLU.
Cheers.