Category Archives: Uncategorized
Probability in the Wild
This isn’t really “statistics” or “in the wild” since it’s a problem from an advanced probability class I am taking, but it’s a neat problem so you’ll just have to live with.
Here is the game:
I get a die and the opponent gets a die. (These dice are not necessarily numbered 1,2,3,4,5,6. The sides could be numbered 1,1,1,3,3,3, for example, but the dice has six sides.) We both roll. If you’re number is higher than my number, you win a dollar. If my number is higher than your number, I win a dollar.
Can you design three six sided dice in such a way, that if the opponent chooses his die first, I can always choose a die from that remaining two that makes my expected winnings positive?
STOP HERE IF YOU WANT TO TRY THIS YOURSELF!

You probably didn’t even try. So here’s an answer.
Here are three possible dice:
A: 0 0 6 6 6 6
B: 4 4 4 4 4 4
C: 2 2 2 2 8 8
Dice A will beat Dice B with probability 2/3.
Dice B will beat Dice C with probabilty 2/3.
Dice C will beat Dice A with probabilty 5/9.
(Pr(C>A)=P(C>A|A=0)*P(A=0)+P(C>A|A=6)*P(A=6)=1*1/3+1/3*2/3=5/9)
So if I let my opponent choose first, I can always make a choice that will make my expectation positive.
The moral:
Make these dice, head to a bar, and fleece some drunks ( Just don’t tell them I sent you). I guess that’s the “in the wild” part.
Cheers.
Shark, the Global Financial Crisis, and Correlation (in the wild)
I was reading the fantastic blog The Bernoulli Trial today, (the Stock Market Seismometer is very interesting.) and I stumbled onto one of his links which brought me to The Mr Science Show. All of the posts are very interesting, but one post seemed particularly relevant to this page.
One of the topics I have talked about in the past is correlations between two variables when the correlation is just a coincidence (like me being in grad school and the Red Sox winning the World Series). Here is a post on Mr. Science Show which relates the global financial crisis to the number of fatal shark attacks in the world. In fact, this correlation has won the prestigious Mr. Science Show Correlation of the Week. Congratulations.
Cheers.
Shirts in the Wild
Having trouble dressing yourself and love statistics? Solve your problems with T-Shirts with witty statistics slogans from SassyStatistics.com (The link was sent in to statsinthewild by avid reader Kevin S. Bagge.)
My favorite would be: “One way or ANOVA, I’m gonna getcha.”
(Note: I am in no way financially involved with SassyStatistics.com)
(Note: If Sassy statistics wants to pay me, then I will gladly be involved with SassyStatistics.com)
Cheers.
Data Privacy in the Wild
I’m currently working on a systematic literature review of statistical issues of privacy. One of the papers I was reading today was from the Census. This paper discusses how how the Census maintains the privacy of individuals when it releases microdata to the public.
(I’ve been reading some papers which propose methods that are much more sophisticated and (hopefully) provide more privacy than the rather simple methods that the census mentions in this paper. I have no opinion on whether or not these methods are “sophisticated enough” to maintain privacy, I just think its interesting to see what the census is actually doing.)
The paper mention three techniques that the census currently uses to maintain privacy.
1.) release of data for only a sample of the population
2.) limitation of detail
3.) top/bottom-coding
Here is an explanation from the census as to why they use these techniques:
“The Census Bureau currently uses several standard techniques to mask microdata sets. The first is a release of data for only a sample of the population. Intruders (i.e., those who query the file for the sole purpose of identifying particular individuals with unique traits) realize that there is only a small probability that the file actually contains the records for which they are looking. The Bureau currently releases three public use samples of the decennial census respondents. One is a 1 percent sample of the entire population, the second a 5 percent sample, and the third a
sample of elderly residents. Each is a systematic sample chosen with a random start. None of these files overlap, so there is no danger of matching to each other. Most demographic surveys are 1-in-1000 and 1-in-1500 “random” samples. Generally the public use file for each survey contains records for each respondent. The second technique involves the limitation of detail. The Census Bureau releases no geographic identifiers which would restrict the record to a sub-population of less than 100,000. It also Arecodes@ some continuous values into intervals and combines sparse categories. Intruders must have extremely fine detail for other highly sensitive fields in order to positively
identify targets. The third technique protects the detail in sensitive responses in continuous fields. It is referred
to as top/bottom-coding. This method collapses extreme values of each sensitive field into a single value. For example, the record of an individual with an extremely high income would not contain his exact income but rather a code showing that the income was over $100,000. Similarly the low-income records would contain a code signifying the income was less than $0. In this example $0 is a bottom-code and $100,000 a top-code for the sensitive or high visibility field of income.” –CONTROLLED DATA-SWAPPING TECHNIQUES FOR MASKING PUBLIC USE MICRODATA SETS, Richard A. Moore, Jr.
The rest of the paper goes on to discuss a more sophisticated method of maintaining privacy called data swapping (first proposed by Dalenius and Reiss (1980)). If your not bored to tears already, then it’s probably worth reading about. I think it’s interesting.
Cheers.
R in the wild
Here is an interesting blog post about how Google and Facebook are using the R (Free statistical software) at their companies.
Cheers.
The New Netflix Prize (in the wild)
I know it’s been a while since I’ve posted. I’ve been using twitter (Follow StatsInTheWild)a lot more for the smaller posts.
Anyway, I’ve been preparing and studying for my general exam (you might call it a comprehenisive exam), and so I’ve been reading a lot about disclosure limitation, my disseration topic. In putting together a presentation explaining why disclosure control is necessary, I’ve listed two examples of really bad disclosures. The first was presented by Latanya Sweeney in her paper on k-anonymity . She took supposedly anonymized data released by the Group Insurance Commision (GIC) and, using publicly available voting records, identified former Massachusett’s Governor William Weld.
This is a huge problem in many areas of research. So many people rely on public released data for research, but organizations may be aprehensive to release their data due to privacy concerns. My second example involves the data from the Netflix prize. Narayanan and Shmatikov (2008), in their paper “How To Break Anonymity of the Netflix Prize Dataset”, use Netflix Prize data along with Internet Movie Database data.
They say in their abstract: “We present a new class of statistical de-anonymization attacks against high-dimensional micro-data, such as individual preferences, recommendations, transaction records and so on.
Our techniques are robust to perturbation in the data and tolerate some mistakes in the adversary’s background knowledge. We apply our de-anonymization methodology to the Netflix Prize dataset, which
contains anonymous movie ratings of 500,000 subscribers of Netflix, the world’s largest online movie rental service. We demonstrate that an adversary who knows only a little bit about an individual subscriber
can easily identify this subscriber’s record in the dataset. Using the Internet Movie Database as the source of background knowledge, we successfully identified the Netflix records of known users, uncovering
their apparent political preferences and other potentially sensitive information.”
So what is the problem with disclosing what movies someone is watching? It’s illegal. Check out Video Privacy Protection Act of 1988 (18 U.S.C. 2710 (2002)). So someone decided to sue Netflix. This article from the Privacy Law Blog has a great artcle summarizing the situation: Netflix Sued for “Largest Voluntary Privacy Breach To Date”.
From that article: “Plaintiffs argue this disclosure constitutes a sever invasion of their privacy by Netflix, which violates, among other things, the Video Privacy Protection Act of 1988 (18 U.S.C. 2710
(2002)). Additionally, the lead plaintiff in this case, Jane Doe, claims that Netflix’s disclosure of her movie rental history and ratings has and/or will ‘identify or permit inference of her sexual orientation… [which… ] would negatively affect her ability to pursue her livelihood and support her family, and would hinder her and her children’ ability to live peaceful lives within Plaintiff Doe’s community.'”
If you’re a lawyer, I’d love some quick comment on this case. So anyway, that’s some background and it brings me to my main point. I was going to suggest that Netflix run another Netflix prize. After
checking, they have already decided to do that.
So instead, I’ll just suggest what the second Netflix contest should be. This contest would be a prize for figuring out a way of releasing Netflix data in such a way that valid inference can still be made while, at the same time maintaining privacy. The format would be as follows: Netflix brings in experts in statistical disclosure limitation (for example, Jerry Reiter (His papers on the subject)) to create private versions of the Netflix data for release to the public (for example with Synthetic Data).
Say there were 10 experts. Netflix would put $100,000 dollars in escrow for each expert. Anyone who can demonstrate a privacy breach in any of the private data sets within 12 months gets the $100,000 of the expert who created the data set. If twelve month’s elapses, the expert keeps the $100,000 (let’s say they have to donate a portion to the charity of their choice.) The other part of the contest would be similar to the
first Netflix prize, but private data would be used in the modelling efforts. One suggestion I read for the second Netflix prize was predicting churn.
Whoever comes up with the best (by some appropriate criteria) model for predicting churn using the private data win’s $1,000,000.
Further,whichever expert’s private data was used to create the best model win’s another $100,000. So the expert’s have financial incentive to create private data that is as useful as possible. Users have two
incentives: 1) demonstrating a privacy breach or 2) improving churn models. They can work on either or both. If a privacy breach is ever demonstrated for a particular data set, all models for that data set are disqualified. Some potential problems to this proposal are defining what exactly a privacy breach is. It’s up to Netlfix to decide these details. Framing the contest this way will accomplish several goals. First, showing that they are concerned with privacy may earn them points with customers who are worried about such things. Second, if
they want to keep doing Netflix prizes, and I suspect they do, they are going to run into this privacy problem over and over again. By dealing with it now, they will be able to continue thei Netflix prizes by releasing useful data to the public for research in a private way. Further, if they demonstrate a way to release data with high utility and high privacy, other potential data releasing organizations could use the protocols that Netflix pioneered with their second Netflix prize.
Also, Netflix if you are reading this and you want to hire me, I could be lured away from grad school for the right price.
Cheers.
As a Netflix user I recommend Big Fan , did not like The 39 Steps the next movie in my queue is Funny People.
College in the Wild
Here is an interesting article titled “The Great College Hoax” which was emailed to me by an avid reader.
Here is an excerpt from the article:
“College graduates will earn $1 million more than those with only a high school diploma, brags Mercy College radio ads running in the New York area. The $1 million shibboleth is a favorite of college barkers.
Like many good cons, this one contains a kernel of truth. Census figures show that college grads earn an average of $57,500 a year, which is 82% more than the $31,600 high school alumni make. Multiply the $25,900 difference by the 40 years the average person works and, sure enough, it comes to a tad over $1 million.
But anybody who has gotten a passing grade in statistics knows what’s wrong with this line of argument. A correlation between B.A.s and incomes is not proof of cause and effect. It may reflect nothing more than the fact that the economy rewards smart people and smart people are likely to go to college. To cite the extreme and obvious example: Bill Gates is rich because he knows how to run a business, not because he matriculated at Harvard. Finishing his degree wouldn’t have increased his income.” – The Great College Hoax, Kathy Kristof 02.02.09, 12:00 AM ET
This is a good point, but I wouldn’t say that the statement “College graduates will earn $1 million more than those with only a high school diploma” only contains “a kernel of truth”. I would say that it is a completely accurate statement. College graduates do earn more than those with a high school diploma. She herself cites the census as proof of this statement.
She also points out that just because Bachelor’s degrees are correlated with higher incomes doesn’t necessarily mean that Bachelor’s degrees cause higher income. That’s true. But isn’t it at least possible that Bachelor’s degrees do cause higher income?
And, finally, isn’t her argument about understanding the difference between correlation versus causation a subtle argument for more people to take statistics classes? Maybe in college? Because education is important? It’s likely that she, herself, learned of this concept in a statistics class at her alma mater, the University of Southern California.
The bottom line here is that statistics are important for navigating the modern world that we live in and a great place to learn about statistical concepts (as well as many other worth academic pursuits) is in college. In learning about these, however, no student should have to go so far into debt that they are financially ruined for the next ten, twenty, thirty years and even possibly the rest of their lives.
Cheers.
Good Jobs in the Wild
I was reading thebernoullitrial blog and came across this link to an article in the Wall Street Journal about the best and worst jobs in the United States.
Number 3!
Cheers.
Outliers in the wild
So, I was at Barnes and Noble today with a few hours to kill. I sat down and I started reading Malcolm Gladwell’s new book, Outliers: The Story of Success. The further I read, the more it became clear to me that he wasn’t really talking about outliers. Also, since I have my qualifying exam on January 19th, it can’t hurt to do a review on detecting outliers. (I started this post before my qualifier which has since come and gone. I passed by the way.) I go on to do a review of what outliers are and then conlcude by explaining how Gladwell’s book isn’t really about outliers. If the middle part bores you just skip to the conclusion. (Note: I love Gladwell’s work. I have read all of his book and all of his New Yorker articles.)
Let’s try to answer this question: What is an outlier? One good answer can be found here at wolfram.com and another good explanation here. If you are interested in the wikipedia answer, that can be found here.
So how do we look for outliers? Say we only have one variable. A common way of defining outliers (suggested at wolfram and most intro stat classes) is to look for observations that are above Q3+1.5*IQR or below Q1-1.5*IQR. Here Q1 is the first quartile of the data (the point at which 25% of the data is below) and Q3 is the third quartile (the point at which 75% of the data is below). IQR is the interquartile range and is defined to be Q3-Q1. Below is a Box and Whisker plot of 99 random observations from a standard normal distribution and one observation with value 10. The box represents the IQR, while points outside of the whiskers are considered outliers.
Now consider that we wish to relate to variables, X and Y, using simple linear regression using the model Y=B0+B1*X+epsilon (where epsilon ~ N(0,sigma^2) and sigma^2 is fixed but unknown). It is very important to look for outliers in the X direction as they may heavily impact the final estimates of B0 and B1. A measure, called leverage, is used to check for outliers in the X direction. The leverage for the i-th point is defined as the i-th diagonal of the projection matrix P=X*g-inv(X’X)*X’. The first column in the data below is the response variable Y, the second column is a column of ones for the intercept and the third column is the predictor varaible. Therefore in my formula for P, the X matrix I am talking about is the second and third columns of the data.
| -2 | 1 | 1 |
| -1 | 1 | 1 |
| 0 | 1 | 10 |
| 2 | 1 | 1 |
| 2 | 1 | 1 |
The corresponding measure of leverage are: 0.We con25 0.25 1.00 0.25 0.25
We see that observation 3 has a leverage of 1. This is the maximum leverage that can be achieved by a data point, and this occurs when the regression line passes through the observation. We consider an observation to be an outlier in X is the leverage is large. So ,clearly, this value is an outlier in X. Such a large value of leverage should concern a good statistician as the point may also have large influence. (Here, however, even though observation 3 has the max leverage, the point has no influence. If we removed it from the analysis, the ordinary least quares regression line would not change at all.) A very good (and more thorough) explantion of leverage and influence (Cooks Distance) can be found here. (I am partial to DFFITS for measuring influence, myself.) Anyway, back to outliers. Alright. So once I fit my model I get predicted values of my response varible. We’ll call these observations y_hat. Using these we can define a residual as the quantity y-y_hat. Now if this value of the residual is large relative to the estimated value of sigma^2 (MSE is used to estimate sigma^2), then we consider that observation as a whole to be an outlier. The Conclusion If you have read this whole post, Cheers. If you skipped the middle part and came straight from the intro, welcome to the conclusion. I hope you enjoy your stay. So my big point is this: Nothing that Gladwell talks about in his book is really an outlier. Consider this example. You go and collect a whole bunch of data on 7 children’s heights. You collect 48,48,49,51,52,45,67. When you do a boxplot you see that the observation 67 is an outlier.
However, when you consider age as a predictor of height, you can see that the child who was 67 inches was older than the rest of the children. Surely, none of these observations can be considered outliers when age is factored in. A child would be an outlier only if there were significantly taller or shorter then their age would predict.
In Gladwell’s book he talks about the Bill Gates and the Beatles as being outliers in terms of success. Considered by themselves, yes, Bill Gates and the Beatles are outliers on many scales including success and income. However, he then goes on to look at what makes these “outliers” and he concludes that in order to be an expert you need 10,000 hours of training. Well, if success is a function of training and it takes 10,000 hours to become an expert, then the like sof the Beatles and Bill Gates aren’t outliers at all. They are exactly as successful as they are predicted to be since clearly both this band and this software engineer have put in well over 10,000 hours. An outlier in this case would be someone who practiced for 10,000 hours, but was very unsuccessful or, on the other hand, someone who doesn’t practice at all but is wildly successful. Gladwell admits himself that he couldn’t find any examples of wild success without putting in the training. So his book isn’t really about outliers at all. He is just looking at the top one percent of the top one percent. All this being said, I still think Outliers: The Story of Success is a very entertaining and interesting book. Also, be sure to check out his other books Blink and The Tipping Point (my favorite). And definately be sure to read the Malcolm Gladwell archive of his old New Yorker articles.
Cheers.
Distorted Statistics in the Wild
Here is an article titled “Girls Gone Bad: Statistics Distort the Truth.” I think this is a misleading headline. Statistics don’t distort the truth; the misinterpretaion of statistics distorts the truth.
Here are the two comments on the article from the site:
The first from xpan09: “That’s true. Statistics should wait before announcing a rise in violence until policies concerning them have leveled off. Also, what I’ve seen of girls being violent has not changed from my perspective and has been constant throughout my years.”
and the second from Ival: “Accurate statistics are just that and nothing more. Only liars and spinners distorth truth.”
Ival really sums up my point. Thanks Ival.
Cheers.