# Brown/Warren, Polls, Sampling, and Confidence Intervals

I was reading an article on Huffington Post about the Massachusetts Senate election, and they link to an article that cites a poll conducted by Western New England University’s (nee College) Polling Institute.  I was interested in this because I grew up near the college, and I have never heard of this polling institute before.  So, I decided to take a look.  I was reading their survey and got to the end where it had a description of the methodology.  I was casually reading it and some things jumped out at me.  I have one question and one comment.

First the question:

They state in their methodology:

The Polling Institute dialed household telephone numbers, known as “landline numbers,” and cell phone numbers for the survey. In order to draw a representative sample from the landline numbers, interviewers first asked for the youngest male age 18 or older who was home at the time of the call, and if no adult male was present, the youngest female age 18 or older who was at home at the time of the call.

This seems to me like it will bias the sample as they are much more likely to be taking a sample of men than women.  They do note that “The landline and cell phone data were combined and weighted to reflect the adult population of Massachusetts by gender, race, age, and county of residence using U.S. Census estimates for Massachusetts”, but then why ask for the youngest male over 18?  Is this a valid method?  It seems that in the final results they have a nearly even split of men vs women, but it seems to me that using this method your going to get a sample that is biased toward younger, male voters.  Can someone explain to me why this is or is not valid?  I really don’t know, but it seems odd to me.

And now the comment:

In the next paragraph, they state:

All surveys are subject to sampling error, which is the expected probable difference between interviewing everyone in a population versus a scientific sampling drawn from that population. The sampling error for a sample of 444 likely voters is +/- 4.6 percent at a 95 percent confidence interval. Thus if 55 percent of likely voters said they approved of the job that Scott Brown is doing as U.S. Senator, one would be 95 percent sure that the true figure would be between 50.4 percent and 59.6 percent (55 percent +/- 4.6 percent) had all Massachusetts voters been interviewed, rather than just a sample. The margin of sampling error for the sample of 545 registered voters is +/- 4.2 percent at a 95 percent confidence interval. Sampling error increases as the sample size decreases, so statements based on various population subgroups are subject to more error than are statements based on the total sample. Sampling error does not take into account other sources of variation inherent in public opinion studies, such as non-response, question wording, or context effects.

This is simply an incorrect explanation of a confidence interval (which I’ve actually written about before a long time ago when I first started this blog).  In frequentist statistics there is this true value that you are trying to estimate that is assumed to be fixed and also unknown (hence you are trying to estimate it). A sample of data is then collected to try to estimate the unknown quantity and a confidence interval can be constructed.  However, the probability that the true figure is in this confidence interval is either 0 or 1 since there is nothing stochastic about the true value that is being estimated.  This interpretation will lose you points on a statistics test.  So, I don’t know what they mean by being “95% sure” here.  The true interpretation is that 95% of similarly constructed intervals will contain the true value.  This is a different statement than being 95% sure the true value is between the upper and lower limits of the one confidence interval you have constructed from your one sample.  Imagine that you conducted this survey with exactly the same N many times.  Each time you will come up with a different estimate of the true figure and a different confidence interval.  If you examined all of theses confidence intervals together, 95% of them would contain the true value of the parameter that is being estimated.  This is a pretty common misinterpretation of the meaning of a confidence interval and it took me quite a long time to understand the difference, but what concerns me here is that this isn’t an intro stats course, it’s a polling institute.

Cheers.