Outliers in the wild
So, I was at Barnes and Noble today with a few hours to kill. I sat down and I started reading Malcolm Gladwell’s new book, Outliers: The Story of Success. The further I read, the more it became clear to me that he wasn’t really talking about outliers. Also, since I have my qualifying exam on January 19th, it can’t hurt to do a review on detecting outliers. (I started this post before my qualifier which has since come and gone. I passed by the way.) I go on to do a review of what outliers are and then conlcude by explaining how Gladwell’s book isn’t really about outliers. If the middle part bores you just skip to the conclusion. (Note: I love Gladwell’s work. I have read all of his book and all of his New Yorker articles.)
Let’s try to answer this question: What is an outlier? One good answer can be found here at wolfram.com and another good explanation here. If you are interested in the wikipedia answer, that can be found here.
So how do we look for outliers? Say we only have one variable. A common way of defining outliers (suggested at wolfram and most intro stat classes) is to look for observations that are above Q3+1.5*IQR or below Q1-1.5*IQR. Here Q1 is the first quartile of the data (the point at which 25% of the data is below) and Q3 is the third quartile (the point at which 75% of the data is below). IQR is the interquartile range and is defined to be Q3-Q1. Below is a Box and Whisker plot of 99 random observations from a standard normal distribution and one observation with value 10. The box represents the IQR, while points outside of the whiskers are considered outliers. Now consider that we wish to relate to variables, X and Y, using simple linear regression using the model Y=B0+B1*X+epsilon (where epsilon ~ N(0,sigma^2) and sigma^2 is fixed but unknown). It is very important to look for outliers in the X direction as they may heavily impact the final estimates of B0 and B1. A measure, called leverage, is used to check for outliers in the X direction. The leverage for the i-th point is defined as the i-th diagonal of the projection matrix P=X*g-inv(X’X)*X’. The first column in the data below is the response variable Y, the second column is a column of ones for the intercept and the third column is the predictor varaible. Therefore in my formula for P, the X matrix I am talking about is the second and third columns of the data.
The corresponding measure of leverage are: 0.We con25 0.25 1.00 0.25 0.25 We see that observation 3 has a leverage of 1. This is the maximum leverage that can be achieved by a data point, and this occurs when the regression line passes through the observation. We consider an observation to be an outlier in X is the leverage is large. So ,clearly, this value is an outlier in X. Such a large value of leverage should concern a good statistician as the point may also have large influence. (Here, however, even though observation 3 has the max leverage, the point has no influence. If we removed it from the analysis, the ordinary least quares regression line would not change at all.) A very good (and more thorough) explantion of leverage and influence (Cooks Distance) can be found here. (I am partial to DFFITS for measuring influence, myself.) Anyway, back to outliers. Alright. So once I fit my model I get predicted values of my response varible. We’ll call these observations y_hat. Using these we can define a residual as the quantity y-y_hat. Now if this value of the residual is large relative to the estimated value of sigma^2 (MSE is used to estimate sigma^2), then we consider that observation as a whole to be an outlier. The Conclusion If you have read this whole post, Cheers. If you skipped the middle part and came straight from the intro, welcome to the conclusion. I hope you enjoy your stay. So my big point is this: Nothing that Gladwell talks about in his book is really an outlier. Consider this example. You go and collect a whole bunch of data on 7 children’s heights. You collect 48,48,49,51,52,45,67. When you do a boxplot you see that the observation 67 is an outlier. However, when you consider age as a predictor of height, you can see that the child who was 67 inches was older than the rest of the children. Surely, none of these observations can be considered outliers when age is factored in. A child would be an outlier only if there were significantly taller or shorter then their age would predict. In Gladwell’s book he talks about the Bill Gates and the Beatles as being outliers in terms of success. Considered by themselves, yes, Bill Gates and the Beatles are outliers on many scales including success and income. However, he then goes on to look at what makes these “outliers” and he concludes that in order to be an expert you need 10,000 hours of training. Well, if success is a function of training and it takes 10,000 hours to become an expert, then the like sof the Beatles and Bill Gates aren’t outliers at all. They are exactly as successful as they are predicted to be since clearly both this band and this software engineer have put in well over 10,000 hours. An outlier in this case would be someone who practiced for 10,000 hours, but was very unsuccessful or, on the other hand, someone who doesn’t practice at all but is wildly successful. Gladwell admits himself that he couldn’t find any examples of wild success without putting in the training. So his book isn’t really about outliers at all. He is just looking at the top one percent of the top one percent. All this being said, I still think Outliers: The Story of Success is a very entertaining and interesting book. Also, be sure to check out his other books Blink and The Tipping Point (my favorite). And definately be sure to read the Malcolm Gladwell archive of his old New Yorker articles.