# Outliers in the wild

So, I was at Barnes and Noble today with a few hours to kill. I sat down and I started reading Malcolm Gladwell’s new book, Outliers: The Story of Success. The further I read, the more it became clear to me that he wasn’t really talking about outliers. Also, since I have my qualifying exam on January 19th, it can’t hurt to do a review on detecting outliers. (I started this post before my qualifier which has since come and gone. I passed by the way.) I go on to do a review of what outliers are and then conlcude by explaining how Gladwell’s book isn’t really about outliers. If the middle part bores you just skip to the conclusion. (Note: I love Gladwell’s work. I have read all of his book and all of his New Yorker articles.)

Let’s try to answer this question: What is an outlier? One good answer can be found here at wolfram.com and another good explanation here. If you are interested in the wikipedia answer, that can be found here.

So how do we look for outliers? Say we only have one variable. A common way of defining outliers (suggested at wolfram and most intro stat classes) is to look for observations that are above Q3+1.5*IQR or below Q1-1.5*IQR. Here Q1 is the first quartile of the data (the point at which 25% of the data is below) and Q3 is the third quartile (the point at which 75% of the data is below). IQR is the interquartile range and is defined to be Q3-Q1. Below is a Box and Whisker plot of 99 random observations from a standard normal distribution and one observation with value 10. The box represents the IQR, while points outside of the whiskers are considered outliers. Now consider that we wish to relate to variables, X and Y, using simple linear regression using the model Y=B0+B1*X+epsilon (where epsilon ~ N(0,sigma^2) and sigma^2 is fixed but unknown). It is very important to look for outliers in the X direction as they may heavily impact the final estimates of B0 and B1. A measure, called leverage, is used to check for outliers in the X direction. The leverage for the i-th point is defined as the i-th diagonal of the projection matrix P=X*g-inv(X’X)*X’. The first column in the data below is the response variable Y, the second column is a column of ones for the intercept and the third column is the predictor varaible. Therefore in my formula for P, the X matrix I am talking about is the second and third columns of the data.

 -2 1 1 -1 1 1 0 1 10 2 1 1 2 1 1