Category Archives: Sampling
I just came across this poll question on ESPN about people’s opinions regarding the Yunel Escobar’s suspension, and I couldn’t help but notice that this map (below) look very similar to the 2008 presidential election map. I suppose you could view the map below as essentially serving as a state by state snap shot of how residents (well residents who read ESPN.com and vote in their polls) feel about homosexual slurs. Take a look at this map:
Now take a look at this map that shows which states voted for McCain and Obama in the 2008 presidential election:
The similarities are striking. The states that judge the punishment to be too harsh, the blue states in ESPN’s map, align very strongly with red states in the presidential map. Similarly, the green (Too lenient) and red (Just right) states align very closely with the blue states in the presidential election map. All six of the green states voted for Obama in the last election with the exception of South Dakota. Of the red states in the ESPN map only Montana, North Dakota West Virginia went for McCain in 2008. Likewise, of the blue states in the ESPN map only Indiana, North Carolina, and Florida went for Obama. Finally, here is a two by two table with a break down of the relationship between Yunel Escobar opinions and presidential states. (The p-value for the fisher test of independence between rows and columns is 8.213e-07). So, it seems there is a significant association between how people feel about Yunel Escobar’s suspension for using a gay slur and how that state voted in the 2008 presidential election.
I was reading an article on Huffington Post about the Massachusetts Senate election, and they link to an article that cites a poll conducted by Western New England University’s (nee College) Polling Institute. I was interested in this because I grew up near the college, and I have never heard of this polling institute before. So, I decided to take a look. I was reading their survey and got to the end where it had a description of the methodology. I was casually reading it and some things jumped out at me. I have one question and one comment.
First the question:
They state in their methodology:
The Polling Institute dialed household telephone numbers, known as “landline numbers,” and cell phone numbers for the survey. In order to draw a representative sample from the landline numbers, interviewers first asked for the youngest male age 18 or older who was home at the time of the call, and if no adult male was present, the youngest female age 18 or older who was at home at the time of the call.
This seems to me like it will bias the sample as they are much more likely to be taking a sample of men than women. They do note that “The landline and cell phone data were combined and weighted to reflect the adult population of Massachusetts by gender, race, age, and county of residence using U.S. Census estimates for Massachusetts”, but then why ask for the youngest male over 18? Is this a valid method? It seems that in the final results they have a nearly even split of men vs women, but it seems to me that using this method your going to get a sample that is biased toward younger, male voters. Can someone explain to me why this is or is not valid? I really don’t know, but it seems odd to me.
And now the comment:
In the next paragraph, they state:
All surveys are subject to sampling error, which is the expected probable difference between interviewing everyone in a population versus a scientific sampling drawn from that population. The sampling error for a sample of 444 likely voters is +/- 4.6 percent at a 95 percent confidence interval. Thus if 55 percent of likely voters said they approved of the job that Scott Brown is doing as U.S. Senator, one would be 95 percent sure that the true figure would be between 50.4 percent and 59.6 percent (55 percent +/- 4.6 percent) had all Massachusetts voters been interviewed, rather than just a sample. The margin of sampling error for the sample of 545 registered voters is +/- 4.2 percent at a 95 percent confidence interval. Sampling error increases as the sample size decreases, so statements based on various population subgroups are subject to more error than are statements based on the total sample. Sampling error does not take into account other sources of variation inherent in public opinion studies, such as non-response, question wording, or context effects.
This is simply an incorrect explanation of a confidence interval (which I’ve actually written about before a long time ago when I first started this blog). In frequentist statistics there is this true value that you are trying to estimate that is assumed to be fixed and also unknown (hence you are trying to estimate it). A sample of data is then collected to try to estimate the unknown quantity and a confidence interval can be constructed. However, the probability that the true figure is in this confidence interval is either 0 or 1 since there is nothing stochastic about the true value that is being estimated. This interpretation will lose you points on a statistics test. So, I don’t know what they mean by being “95% sure” here. The true interpretation is that 95% of similarly constructed intervals will contain the true value. This is a different statement than being 95% sure the true value is between the upper and lower limits of the one confidence interval you have constructed from your one sample. Imagine that you conducted this survey with exactly the same N many times. Each time you will come up with a different estimate of the true figure and a different confidence interval. If you examined all of theses confidence intervals together, 95% of them would contain the true value of the parameter that is being estimated. This is a pretty common misinterpretation of the meaning of a confidence interval and it took me quite a long time to understand the difference, but what concerns me here is that this isn’t an intro stats course, it’s a polling institute.
I was recently reading Gelman’s blog, and he wrote a post about this: Patents Aren’t Only for Engineers. Apparently, some actuary received a patent for statistical sampling. The author (is that what they are called?) of the patent, Jay Vadiveloo, is a mathematics professor in residence at the University of Connecticut, and he is quoted in the article as saying:
To me, the results were astounding: statistical sampling worked.
At this point, I thought I got the joke and I went to check the date that the article was published expecting to see April 1. Nope. May 12 actually. So, this is serious? If this is serious, this is — how can I say this nicely — astonishingly…..I have nothing nice to say here. So I will say nothing. Congratulations, professor Vadiveloo.
Gelman goes on to write:
P.S. Mendelssohn writes: “Yes, I felt it was a heartwarming story also. Perhaps we can get a patent for regression.”
I say, forget a patent for regression. I want a patent for the sample mean. That’s where the real money is. You can’t charge a lot for each use, but consider the volume!
This reminded of a conversation I had with a fellow graduate student in statistics who had done an internship at an insurance company over the course of a summer. They explained to me that someone in the insurance industry had either tried to patent or already had a patent for multiple regression for use in insurance. Now, I wasn’t sure how true the story was when I heard it, and I’m still not sure how true the story is, but if you can get a patent on sampling, I suppose anything is possible.
Question for a lawyer: If someone can get a patent on sampling after it has been around in the statistics literature for a very, very long time, can I just go through more recent statistics literature and just start filing patents on other people’s ideas?
I’m not actually going to do that, but that is a serious question. What is preventing me from doing that?