Stats in the Wild

Starbucks and losing bucks (in the wild)

So Starbucks caused the global crisis, right?

Slate article about Starbucks and the financial crisis.
From Slate:
“At first blush, there’s a pretty close correlation between a country having a significant Starbucks presence, especially in its financial capital, and major financial cock-ups, from Australia (big blowups in finance, hedge funds, and asset management companies; 23 stores) to the United Kingdom (nationalization of its largest banks). In many ways, London in recent years has been a more concentrated version of New York—the wellspring of many toxic innovations, a hedge-fund haven. It sports 256 Starbucks. In Spain, which is now grappling with the bursting of a speculative coastal real-estate bubble (sound familiar?), the financial capital, Madrid, has 48 outlets. In crazy Dubai, 48 Starbucks outlets serve a population of 1.4 million. And so on: South Korea, which is bailing outs its banks big time, has 253; Paris, the locus of several embarrassing debacles, has 35.”

Posted in Uncategorized

Leave a comment

Stats in the Wild: Law School

Oct 21

Posted by statsinthewild

This is good:
Law School and Stats

You went to law school to get away from stats. But it followed you. You know why? Cause you’re in the wild.

A note about the article: The author mentions in the first paragraph that “Usually, the investigator seeks to
ascertain the causal effect of one variable upon another—the effect of a price increase upon demand, for example, or the effect of changes in the money supply upon the inflation rate.”

One needs to be careful about the differences between correlation and causation. The investigator is often interested in establishing a causal relationship between two variables, but that can only be done through a well designed randomized experiment. If we do not have a randomly designed experiment, the best the investigator can do is establish a correlation between two varaibles. (More to come on the difference between causation and correlation.)

As wikipedia says:

“The concept of correlation is particularly noteworthy. Statistical analysis of a data set may reveal that two variables (that is, two properties of the population under consideration) tend to vary together, as if they are connected. For example, a study of annual income and age of death among people might find that poor people tend to have shorter lives than affluent people. The two variables are said to be correlated (which is a positive correlation in this case). However, one cannot immediately infer the existence of a causal relationship between the two variables. (See Correlation does not imply causation.) The correlated phenomena could be caused by a third, previously unconsidered phenomenon, called a lurking variable or confounding variable.”

http://en.wikipedia.org/wiki/Statistics

Cheers.

Posted in Uncategorized

Leave a comment

How confident are you in interpreting confidence intervals?

Oct 21

Posted by statsinthewild

Say I want to estimate the mean of a population. In order to quantify the amount of uncertainty about my guess, I construct a confidence interval. So you might see a statement like “A 95% confidence interval for the mean is (17.6,24.6).”

This means that 95% of the time the true mean is in the interval.

I just lied to you.

In the classical statistical framework, the true value of the mean is unkown and fixed. Therefore, the probability that the true mean is in that interval is either 0 or 1. It’s either in the interval all the time or never in the interval because the true mean is fixed. There is nothing random about it.

What is random is the sample. So, the true interpretation of a confidence interval is this: “If the confidence interval were consturcted in the same way over and over again, 95% of these similarly constructed intervals will contain the true values of the mean.”

So go forth with confidence in your statements about confidence intervals.

Cheers.

Posted in Uncategorized

Leave a comment

Plus or minus what….(in the wild)?

Oct 21

Posted by statsinthewild

I don’t know if many of you know, but there is an election in November. For president. Of the United States.

As a result, you’re probably being inundated with polls. Obama 50 McCain 46 plus or minus 3 points. McCain 48 Obama 46 plus or minus 4 points. You know stuff like that.

So how do they get the plus or minus number?

Let’s take the Zogby poll for 10-17-2008 through 10-19-2008. They “randomly” surveyed 1211 people , and they reported Obama 50% McCain 46% plus or minus 3 points.

So lets think about what is going on. What we are trying to do is estimate the proportion of the population of likely voters that will vote for Obama or McCain. The only way to truly find the proportion who will vote for Obama or McCain is to ask everyone. For a million reasons a polling company can’t just go ask everyone in the country who they are going to vote for. (The only group with enough resources to do that is the government, and even they have trouble.) So we sample (randomly!) from this population to attmept to estimate the true population parameter of interest.

So a polling company goes out and asks N likely voters who they are going to vote for. We are trying to estimate the probabilty of voting for Obama (or McCain). N fixed/ independent trials? That seems like a binomial random variable to me.

A binomial random variable is charachterized by two parameters, N and P. N is our sample size, and we wish to estimate P. We estimate P simply by calclating X/N where X is the number of people voting for Obama (or McCain). We call the estimate P_hat to distinguish from P, the true parameter. The variance of P_hat is estimated by P_hat*(1-P_hat)/N.

Now it just so happens that as the sample size gets large, the distribution of the estimator of P tends towards a normal distribution. Thus we can use a normal approximation to build a confidence interval for the parameter.

With this normal approximation, the 95% confidence interval to the true value of P is P_hat plus or minus 1.96*(Standard Deviation of P_hat).

So back to the Zogby poll. They asked 1211 people who they were going to vote for. 606 responded Obama. So the best guess we can make as the true value of the parameter is 606/1211=.5004. That’s the 50% estimate for Obama. Now we compute 1.96*sqrt(.5004*(1-.5004)/1211)=0.02816138. So our estimate is accurate to wihin 2.8%. Round up to 3 and that’s how Zogby gets its plus or minus number.

Cheers.

Posted in Uncategorized

Leave a comment

Stats in the Wild

Starbucks and losing bucks (in the wild)

Stats in the Wild: Law School

How confident are you in interpreting confidence intervals?

Plus or minus what….(in the wild)?

Blogroll

Comedy

Data Art

Data Viz

Jobs

R

Tag Cloud