Category Archives: Uncategorized
Bradley Effect (in the wild)
The Bradley effect is a supposed effect that Black candidates poll numbers are often higher than the percentage they win in the actual election. One explanation that is offered is that survey espondents do not want to appear racist or respond with an unpopular answer and thus skew the results.
Articles about this appear at salon.com and fivethirtyeight.com.
I agree with the sentiment of Nate Silver, but I think his analysis is lacking. A better analysis of the problem is found here in a paper written by Postdoctoral Fellow Daniel Hopkins.
Hopkins concludes in his paper:
“The Wilder effect occupies an unusual position in our thinking about American elections, as it is often invoked (e.g. Elder, 2007; Lanning, 2005) but rarely scrutinized. By analyzing Senate and Gubernatorial elections between 1989 and 2006, this paper has provided the first largesample test of the Wilder effect. In the early 1990s, there was a pronounced gap between polling and performance for black candidates of about 2.3 percentage points. But in the mid-1990s, that upward bias in telephone surveys disappeared. At a time when scholars are increasingly across states.
At a time when scholars are increaseingly concerned about the validity of phone surveys, these results provide some reassurance. We have also seen that the polling-performance gap is closely related to a candidate’s level of preelection support, meaning that we should not naively attribute the entire observed gap in a given election to racial bias. Douglas Wilder, David Dinkins, and Tom Bradley were all frontrunners, and so all could have expected a small decline in their election day performance. Even
female front-runners should expect declines into election day, although they are not subject to any Wilder-style bias.” – Hopkins (2008)
Sort of unrelated note:
Hopkins in his article says, “Even Tennessee’s 2006 Democratic nominee for Senate, Harold Ford Jr., experienced no Wilder effect after a negative television advertisement targeting him cued anxieties about inter-racial sex.9” The footnote number 9 is, “9. Specifically, a white actress in the late-October advertisement exclaims ‘I met Harold at the Playboy party’ and then closes the advertisement by winking and saying, ‘Harold, call me.'” Amazing.
Which hospital (in the wild)?
Say there are two hospitals in your area well call them General Hospital and Specific Hospital. (What?)
Ok. We’ll actually call them A and B. Hospital A starts a marketing campaign and they want to target specific groups of people. They run two ads. The first says, “If you’re under 65 we have a lower death rate than hospital B.” On another channel (probably CBS) they run another ad sayings, “If you’re over 65, we have a lower death rate than hospital B.”
Hospital B goes out and crunches the numbers and they find that, yes, in fact Hospital A is telling the truth. Hospital B then starts it’s marketing campaign with an ad targeted to everyone saying, “Our hospital has a lower death rate than hospital A.” Hospital A crunches the numbers and finds that hospital B is in fact telling the truth.
How can this be? Here is one scenario:
Hosptial A’s stats are:
Under 65: 49 deaths in 100 patients
Over 65: 100 deahts in 300 patients
Hospital B’s stats are:
Under 65: 50 deaths in 100 patients
Over 65: 170 deahts in 500 patients
So this means for people under 65, hosptial A’s death rate is .49 comapred to .5 for hospital B. For Over 65, hospital A’s death rate is .3333 verus hospital B’s death rate of .34. So for each of the two age groups hospital A has the lower death rate. But the overall death rate for hospital A is .3725 and the overall death rate for hospital B is .3666. So hospital B has a lower overall death rate.
Which hospital would you rather visit?
Cheers.
I can make up correlations too (in the wild).
Hem Line index updated
This is an article from the New York Times updating the hemline index. The hemline index, first proposed by economist George Taylor in the 1920s, is a proposal that says during hard economic times women’s hem lines get longer. Shorter skirts = Better economy.
The author is looking to update the hemline index to more modern times.
She finds some interesting correlations in her research. (The author explicitly mentions that “the causal link is elusive”)
Some indices that are suggested include:
During hard economic times:
Laxative sales go up
Sales of tobacco, carbonated drinks, and eggs go down
Sales of rice, beans, grains, and pasta go up (dry goods)
Property crime rises
people tend to head back to school
During good economic times:
Deodorant sales go up
Sales of lettuce, steak, and fruit go up (perishable items containing water)
She also notes that candy, beer, and pasta are recession proof.
I find all that to be very interesting.
I also find this to be interesting, but for entirely different reason. At the beginning of her article she talks about Terry Pettijohn II, a professor of psychology at Coastal Carolina University. Professor Pettijohn is a professor of psychology who studies how “economic and social factors shape preferences in popular music, movie stars, and Playboy models.” This is his job. He concludes that during hard economic times people like songs that are “longer, slower, and with more meaningful themes.” In another article he concludes that during hard economic times Playboy’s playmate of the year appears to have a “more mature appearance”. (ie older, heavier, taller, and less curvy.) He also finds that in hard economic times American movie stars tend to have “small eyes, large chins, and thin faces” and thus “a more mature appearance.”
These correlations, which may exist, and may in fact be strong correlations, seem a little shaky to me.
But if that’s the game you want to play then let’s have at it. Since 1918 the red sox have won the World Series in 3 years. Since 1918 I have been in grad school for 4 years. The correlation between me being in grad school and the red sox winning a World Series is .65403. This is statistically significant to the .0001 level. (That means it is VERY statistically significant.) I like that game.
So, I’m not sure I buy any of Professor Pettijohn’s correlations (Of course I am making that judgement without reading any of his papers, so I am being a little unfair.), but I do know that this man is smarter than I am because he has convinced someone to pay him to look at Playboy centerfolds. Truly he is living out his boyhood dreams. Congratulations Professor.
Cheers.
Fivethirtyeight.com (in the wild)
Presidential predictions and stats galore: www.fivethirtyeight.com
You can’t spell causal without ACLU……(in the wild)
Stats in the wild: AP article about racial profiling in LA, ACLU press release, and full report by Ian Ayres.
Ayres finds that minorites, including African-Americans and Hispanics, are stopped and searched at disproportionately high rates. He is quoted in the AP article: “The results of this study raise grave concerns that African-Americans and Hispanics are over-stopped, over-frisked, over-searched, and over-arrested,” said report author Ian Ayres, a Yale Law School economist and professor.
Some observations:
1.) It appears that African-Americans and Hispanics are definately “over-stopped, over-frisked, over-searched, and over-arrested”. But is it because of their race? This is hard to prove. We have correlation, but not causation. To do this we would need to randomly assign people to live across the city and randomly assign socio-economic stauses to each person. Clearly, this cannot be done. In reality, often certain races live together in neighborhoods and often have similar socio-economic statuses which can confound the analysis.
2.One statistic he offers and uses in his analysis is “stops per resident of a certain race”. He says, “African Americans were much more likely to be stopped than non-minorities. In the single-year of data, there were more than 4,500 stops for every 10,000 African Americans residents but only 1,750 stops for every 10,000 non-minority residents. In two divisions (Central and Hollywood), there were more stops of African Americans in one year than there were African American residents, meaning that the average number of stops per resident was greater than one.12 See Table 1.”
Is stops per resident a good metric for testing racial disparity? If there is no racial disparity, should we assume that the “stops per resident” would be about the same for each race? If there is an area where resdients are mostly White or mostly African-American, any stop in that area will affect this measure greatly. What we really want is something like “stops per driver” because the demographics of the drivers may be different than the residents. This isn’t hard to believe.
For example, say that in a certain area there are 100,000 residents. 90,000 are white and 10,000 are black. Now say there is a mall in this area and plenty of people drive in from surrounding area. Further assume that there are an equal number of white and black drivers on the road. Now say, that in a given year, cops stop 1000 people, 500 white and 500 black. The stop rates per 10,000 resdients for whites is about 56 per 10,000 and for blacks it is 500 per 10,000. This may not be the case in LA, but this relatively simple example shows how this metric could easily lead to skewed results.
3.) The regression that is done is using a rate as a response variable. This would lend itself nicely to logistic regression, which may be more appropriate.
Note: I have no affiliation with the LAPD or ACLU.
Cheers.
Starbucks and losing bucks (in the wild)
So Starbucks caused the global crisis, right?
Slate article about Starbucks and the financial crisis.
From Slate:
“At first blush, there’s a pretty close correlation between a country having a significant Starbucks presence, especially in its financial capital, and major financial cock-ups, from Australia (big blowups in finance, hedge funds, and asset management companies; 23 stores) to the United Kingdom (nationalization of its largest banks). In many ways, London in recent years has been a more concentrated version of New York—the wellspring of many toxic innovations, a hedge-fund haven. It sports 256 Starbucks. In Spain, which is now grappling with the bursting of a speculative coastal real-estate bubble (sound familiar?), the financial capital, Madrid, has 48 outlets. In crazy Dubai, 48 Starbucks outlets serve a population of 1.4 million. And so on: South Korea, which is bailing outs its banks big time, has 253; Paris, the locus of several embarrassing debacles, has 35.”
Stats in the Wild: Law School
This is good:
Law School and Stats
You went to law school to get away from stats. But it followed you. You know why? Cause you’re in the wild.
A note about the article: The author mentions in the first paragraph that “Usually, the investigator seeks to
ascertain the causal effect of one variable upon another—the effect of a price increase upon demand, for example, or the effect of changes in the money supply upon the inflation rate.”
One needs to be careful about the differences between correlation and causation. The investigator is often interested in establishing a causal relationship between two variables, but that can only be done through a well designed randomized experiment. If we do not have a randomly designed experiment, the best the investigator can do is establish a correlation between two varaibles. (More to come on the difference between causation and correlation.)
As wikipedia says:
“The concept of correlation is particularly noteworthy. Statistical analysis of a data set may reveal that two variables (that is, two properties of the population under consideration) tend to vary together, as if they are connected. For example, a study of annual income and age of death among people might find that poor people tend to have shorter lives than affluent people. The two variables are said to be correlated (which is a positive correlation in this case). However, one cannot immediately infer the existence of a causal relationship between the two variables. (See Correlation does not imply causation.) The correlated phenomena could be caused by a third, previously unconsidered phenomenon, called a lurking variable or confounding variable.”
http://en.wikipedia.org/wiki/Statistics
Cheers.
How confident are you in interpreting confidence intervals?
Say I want to estimate the mean of a population. In order to quantify the amount of uncertainty about my guess, I construct a confidence interval. So you might see a statement like “A 95% confidence interval for the mean is (17.6,24.6).”
This means that 95% of the time the true mean is in the interval.
I just lied to you.
In the classical statistical framework, the true value of the mean is unkown and fixed. Therefore, the probability that the true mean is in that interval is either 0 or 1. It’s either in the interval all the time or never in the interval because the true mean is fixed. There is nothing random about it.
What is random is the sample. So, the true interpretation of a confidence interval is this: “If the confidence interval were consturcted in the same way over and over again, 95% of these similarly constructed intervals will contain the true values of the mean.”
So go forth with confidence in your statements about confidence intervals.
Cheers.
Plus or minus what….(in the wild)?
I don’t know if many of you know, but there is an election in November. For president. Of the United States.
As a result, you’re probably being inundated with polls. Obama 50 McCain 46 plus or minus 3 points. McCain 48 Obama 46 plus or minus 4 points. You know stuff like that.
So how do they get the plus or minus number?
Let’s take the Zogby poll for 10-17-2008 through 10-19-2008. They “randomly” surveyed 1211 people , and they reported Obama 50% McCain 46% plus or minus 3 points.
So lets think about what is going on. What we are trying to do is estimate the proportion of the population of likely voters that will vote for Obama or McCain. The only way to truly find the proportion who will vote for Obama or McCain is to ask everyone. For a million reasons a polling company can’t just go ask everyone in the country who they are going to vote for. (The only group with enough resources to do that is the government, and even they have trouble.) So we sample (randomly!) from this population to attmept to estimate the true population parameter of interest.
So a polling company goes out and asks N likely voters who they are going to vote for. We are trying to estimate the probabilty of voting for Obama (or McCain). N fixed/ independent trials? That seems like a binomial random variable to me.
A binomial random variable is charachterized by two parameters, N and P. N is our sample size, and we wish to estimate P. We estimate P simply by calclating X/N where X is the number of people voting for Obama (or McCain). We call the estimate P_hat to distinguish from P, the true parameter. The variance of P_hat is estimated by P_hat*(1-P_hat)/N.
Now it just so happens that as the sample size gets large, the distribution of the estimator of P tends towards a normal distribution. Thus we can use a normal approximation to build a confidence interval for the parameter.
With this normal approximation, the 95% confidence interval to the true value of P is P_hat plus or minus 1.96*(Standard Deviation of P_hat).
So back to the Zogby poll. They asked 1211 people who they were going to vote for. 606 responded Obama. So the best guess we can make as the true value of the parameter is 606/1211=.5004. That’s the 50% estimate for Obama. Now we compute 1.96*sqrt(.5004*(1-.5004)/1211)=0.02816138. So our estimate is accurate to wihin 2.8%. Round up to 3 and that’s how Zogby gets its plus or minus number.
Cheers.