Being wrong in the wild.
Well, it looks like I was wrong. In one of my posts from the beginning of the NFL playoffs, I spent quite some time complaining about how the Chargers and the Cardinals made the play-offs at 8 – 8 and 9 – 7, respectively. Looks like the Cardinals proved me wrong by making it all the way to the Super Bowl, making them the first 4 seed to make it to the Super Bowl since the league switched to 4 divisions in 2001.
Why has a 4 seed never made it to the Super Bowl (until this year) in the era of 4 divisions? It’s because usually they are the worst team in the play-offs. Since they have switched to 4 divisions, at least one wild card team has had a better record than one of the 4 seed 5 times in 7 seasons. No four seed has ever won a Super Bowl in the four division era, and Arizona is the only 4 seed to make it to the Super Bowl with four divisions.
(Note: This is the first Super Bowl in the 4 division era not to feature a number 1 seed.)
And what happens when we look back to when the NFL had only three divisions? We see that the 3 seed never made it to a Super Bowl, but the 4 seed (the highest wild card team) made it to (and won the Super Bowl) 2 times in the six seasons between 1996 and 2001. I’m going to eventually get around to going back as far as I can, but for now I only have data going back to 1996.) Also, over this same period, two number 4 seeds (the highest seeded wild card) won the super bowl.
So here is my suggestion: Seed teams based on record. If you win your division at 8-8 (San Diego) or 9-7 (Arizona), you should be a lower seed than a wild card team that went 11-5 (Baltimore) or 12-4 (Indianapolis). If you are trying to have a play-off that is the fairest, you should seed based on wins, not how it is currently done. Tennessee got royally screwed this year in the play-offs because of this system. Their reward for being the number one seed? A first round bye and then a showdown with the 11-5 5th seed which they lost, while the number 2 seed had a relatively easy ride to the conference championship by beating the lowly 8-8 4th seeded San Diego Chargers. It just doesn’t make sense to seed San Diego higher than Baltimore in this season. Baltimore is a better team and they had more wins. Make things fair and seed based on record.
Final note: I still think Arizona is terrible. But I have been wrong three weeks in a row. However, let me remind you that a 4 seed has never won a Super Bowl in the 4 division era. And I still think they are terrible. Thusly, the statsinthewild blog’s official pick for Super Bowl XLIII is the Pittsburgh Steelers 24-13.
Cheers.
Losing money in the wild
This link has a good graphical representation of just how much value has been lost by some major banks.
Cheers.
Factor Analysis in the stock market (in the wild)
Well, I’m done with my qualifying exam. I’ll know if I passed by late this week/early next week.
Anyway, here is a short project that I did on factor analysis in November.
Cheers.
Introduction
A major market index in the United States is the Dow Jones Industrial Average. Thirty large industrial companies stock prices contribute to the calculation of the Dow Jones industrial average. These companies are Boeing, Caterpillar, Chevron, Citigroup, Coca Cola, DuPont, Exxon Mobil, General Electric, General Motors, Hewlett Packard, Home Depot, IBM, Intel, Johnson and Johnson, JP Morgan Chase, Kraft Foods, McDonalds, Merck, Microsoft, Pfizer, Proctor and Gamble, United Tech, Verizon, WalMart, Walt Disney, Bank of America, AT and T, American Express, Alcoa, and 3M Company.
The amount of change in the price of these stocks will be highly correlated, as they are all part of the larger market. Factor analysis will be used to reduce the dimensionality of the 30 stocks in the Dow Jones average. This is being done because I am interested to see which stock’s prices move together.
Data
Data was collected from the website finance.yahoo.com. Data consists of the high, low, opening, and closing price of each of the thirty stocks as well as the volume of each stock for each day. Stocks vary in the length for which they have historical data, as some companies have been public longer than other. As such only the last 1000 trading days are considered in the analysis. This includes all data dating back to November 19, 2004. Rather than consider the actual price of the stock (since some stock prices are much higher or lower than others), the change in stock price from one closing bell to the next is considered for all thirty stocks.
Analysis
Using SAS 9.2, a factor analysis was implemented for the differences in closing prices for the 30 Dow Jones stocks over the last 1000 days. Using a scree plot \cite{scree} and by analyzing the eigenvalues of the correlation matrix, a sufficient number of factors will be chosen. Upon finding the principal components, the varimax \cite{Johnson} method will be used to find a final rotated factor solution.
| Stock | Factor 1 | Factor 2 | Factor 3 | Factor 4 | Factor 5 |
| AA | 0.23127 | 0.09893 | 0.23084 | 0.73216 | 0.04921 | AXP | 0.67590 | 0.27178 | 0.30486 | 0.21506 | 0.10536 | BA | 0.22446 | 0.24467 | 0.32912 | 0.32123 | 0.33817 | BAC | 0.82352 | 0.25896 | 0.14310 | 0.14884 | 0.13932 | C | 0.79975 | 0.24980 | 0.14993 | 0.15357 | 0.09632 | CAT | 0.19406 | -0.06364 | 0.12838 | 0.45805 | 0.38608 | CVX | 0.13066 | 0.40635 | 0.20625 | 0.76029 | 0.08143 | DD | 0.39500 | 0.30897 | 0.27833 | 0.43173 | 0.30426 | DIS | 0.37520 | 0.43313 | 0.42601 | 0.27117 | 0.13823 | GE | 0.61010 | 0.27886 | 0.31333 | 0.21191 | 0.14694 | GM | 0.50475 | 0.06398 | 0.15834 | 0.19736 | 0.00832 | HD | 0.53510 | 0.24265 | 0.37274 | 0.02793 | 0.25287 | HPQ | 0.24719 | 0.19816 | 0.70506 | 0.24809 | 0.02544 | IBM | 0.33248 | 0.19050 | 0.68158 | 0.21392 | 0.10399 | INTC | 0.32603 | 0.20892 | 0.61192 | 0.21961 | 0.11821 | JNJ | 0.15935 | 0.71031 | 0.23464 | 0.10390 | 0.17792 | JPM | 0.80200 | 0.25962 | 0.19675 | 0.08128 | 0.15698 | KFT | 0.28979 | 0.45453 | 0.18234 | 0.20696 | 0.14068 | KO | 0.10981 | 0.60928 | 0.39598 | 0.08037 | 0.18114 | MCD | 0.26140 | 0.40935 | 0.36831 | 0.15274 | 0.36240 | MMM | 0.35237 | 0.31052 | 0.31527 | 0.35182 | 0.22999 | MRK | 0.18019 | 0.67967 | 0.06238 | 0.17023 | -0.06161 | MSFT | 0.14925 | 0.35621 | 0.65338 | 0.20696 | 0.12790 | PFE | 0.37601 | 0.57298 | 0.08865 | 0.15824 | -0.00815 | PG | 0.20504 | 0.69431 | 0.20156 | 0.17265 | 0.25142 | T | 0.36670 | 0.53525 | 0.32948 | 0.27273 | 0.00919 | UTX | 0.13055 | 0.15316 | 0.07721 | 0.11207 | 0.79017 | VZ | 0.37186 | 0.52181 | 0.37188 | 0.19101 | 0.01283 | WMT | 0.37782 | 0.46428 | 0.35919 | 0.06918 | 0.23774 | XOM | 0.13787 | 0.44943 | 0.21108 | 0.74470 | 0.09553 |
Results
Keeping five factors, we can see see which stocks load heavily onto which factors by looking at the table. The variables that load heavily onto the first factor include, American Express (AXP), Bank of America (BAC), Citigroup (C), General Electric (GE), General Motors (GM), Home Depot (HD), and JP Morgan (JPM). With the exception of Home Depot and General motors, all of these companies are financial institutions, and General Motors and Home Depot are heavily affected by the availability of credit from these institution as GM sells large ticket items (cars) and HD is heavily tied to people buying houses, and thus affected by the mortgage market. It appears that this first factor explains variation related to the financial sector.
The companies that are heavily loaded onto the second factor include, Chevron (CVX), Disney (DIS), Johnson and Johnson (JNJ), Kraft Foods (KFT), Coca Cola (KO), McDonalds (MCD), Merck (MRK), Pfizer (PFE), Proctor and Gamble (PG), AT and T (T), Verizon (VZ), Wal-Mart (WMT), and Exxon-Mobil (XOM). All of these companies sell items directly to consumers, and the costs involved in each of these transactions with consumers is relatively small. So, it appears this second factor is explaining the variation due to the individual consumer.
The third factor includes Disney, Hewlett-Packard, IBM, Intel, and Microsoft. These companies, with the glaring exception of Disney, are all companies tied to computers. Thus, it appears that the third factor explains variation due to computer industry. While factor four include companies such as Alcoa, Cat, Chevron, DuPont, and Exxon-Mobil. This factor appears to explain variation in the manufacturing market. Both Chevron and Exxon-Mobil appear heavily loaded on both factor 2 and factor 4. This makes sense since both companies can essentially break down their earnings into two components, individual consumer sales and sales to other businesses.
Factor five includes United Technologies by itself, which is interesting because UTX hold such a large variety of companies including, Carrier, Hamilton-Sundstrand, Otis elevators, Pratt and Whitney, and Sikorsky Helicopter.
Conclusions
The movement in stock price of the 30 stocks which comprise the Dow Jones Industrial Average are highly correlated. As such they are a prime candidate for a factor analysis and a dimensionality reduction. Using five factors, we can group the variability in the stock market into categories. Roughly speaking the three categories that explain the most variation are financial, consumer goods, technology. The fourth and fifth factor seem to represent approximately the same dimension, namely, manufacturing and industry.
Using this factor analysis, we no can now view fluctuations in the stock market based on groups rather than the individual stocks. We have reduced the dimensionality of the stock in the Dow Jones from 30 down to 5, while still explaining 60 percent of the variability, greatly simplifying analysis of this stock data.
Future work in this direction could include using more than the past 1000 days of data and possibly including more than 30 stocks in the factor analysis.
Boycott’s in the Wild
Today on Slate, there is an article about Hal Stern’s call for all qunatitative analysts to boycott the BCS. I would like to let everyone know that in addition to Hal Stern and Bill James, the Stats in the Wild Blog is officially joining the boycott. That ought to bring the BCS to its knees.
A few notes about the article:
1. Statistical analysis is fantastic for ranking teams, but I don’t think it should have any place in deciding who wins a national championship. I think it is always preferable to have a deterministic set of criteria to decide who goes to the post season, like all professional sports. That way no one can complain when someone gets into the play-offs. (Except, in the NFL when the criteria lets in 9-7 Arizona and 8-8 San Diego and leaves out 11-5 New England…..) If you want to get in, meet the criteria and stop whining (like me).
In NCAA basketball there are both deterministic criteria (win your league) and at large bids for getting into the tournament. Every year teams are left out of the tournament, but, by including 65 teams, I doubt that any team with a legitamate chance to win a national title has ever been left out. And with the tournament the way it is, it allows team to make great memorable post-season runs. (Remember in 1997 when Arizona was a four seed and they beat THREE number one seeds on their way to a national championship? If NCAA basketball were run like NCAA football Arizona would have won some horseshit bowl game and no one would remember.)
If football went to an 8 (or 16) game play-off, sure good teams wouldn’t get in, but 8 is probably enough teams that you aren’t leaving out anyone who has a real shot to win it.
(Note: These last two notes have nothing to do with statistics and ramble on for much longer than they should. Enjoy.)
2. I hate the NCAA. I hate that young athletes are risking serious injuries and are making no money (except scholarship), while the schools, the administrators, the head coaches, and the television stations are collectively making billions of dollars. The kids see none of this money. And I hate it even more when these scum bag college coaches try to get a kid to stay for a 3rd or 4th year of eligibilty instead of entering the draft. Easy for a coach to say when he is making 3 million dollars a year. If some one was going to offer me a few million a year to do anything when I was a sophomore or junior in college, I would have left in an instant.
3. This is the last paragraph of the article:
“When it doesn’t, you can put the blame on the greedy small schools that wanted to milk money from the big football factories, on the greedy big schools that wanted to keep as much money as possible in the fewest possible hands, on the lunk-head football coaches who can’t program a computer to play tic-tac-toe but want to make all the rules, or on the Congress that sits idly by and watches it happen. You guys want to make a mess of this, you can make a mess of it without our help.”
Listen. I am all for a play-off in college football, but Congress doesn’t need to get involved. If I were to make a list of things that were (or should be) more important to Congress than the BCS and college football, I would never stop writing. You realize we are in TWO wars AND the worst economic crisis since the great depression AND are facing a 1.2 trillion dollar deficit. And you want to fix the BCS? Have some perspective.
Examples of Congress and Senate being ridiculous about footbal:Arlen Specter and T.O., Arlen Specter trying to punish the Patriots, Orrin Hatch whining about Utah’s team, and Cliff Stearns asking for Congress to postpone votes so he can attend the BCS championship game.
A short open letter:
Dear Orrin Hatch, Arlen Specter, and Cliff Stearns,
Grow up.
Sincerely,
The Statsinthewild Blog.
Stats in the NFL: Part 2 (in the wild)
In a previous post, I talked about parity in the NFL (or lack of parity) between teams. What about parity between divisions? What we want to do is test the null hypothesis that all divisions are the same (ie will end the season with the same number of wins.) To do this we can use the Pearson Chi-square test statistic.
In the table below, I have the year in the first column, the value of the test statistic in the second column, and the p-value in the third column. In all of the years, we never end up rejecting the null hypothesis of even strength between the divisions at the commonly used alpha=.05 level. However, we can use the p-values to look at relative parity among the divisions. A small p-value (in the case corresponds to a large test statistic) indicates that there is not parity among the divisions.
year v p
[1,] 2008 12.765625 0.07802902
[2,] 2007 8.937500 0.25717451
[3,] 2006 1.250000 0.98972973
[4,] 2005 2.437500 0.93172939
[5,] 2004 3.312500 0.85466795
[6,] 2003 1.062500 0.99375635
[7,] 2002 4.078125 0.77073626
Lets take a look at some of the interesting years.
In 2003, the win totals of the divisions was 36, 29, 34, 31, 31, 31, 31, 33. The first four are AFC East, North South, and West respectively. The last four are the NFC in the same order. The best division in the league only won 7 more games than the worst division.
Lets compare that with 2008. The wins totals by division were: 38, 31.5, 38, 23, 38.5, 25, 40, 22 (again in the same order as before.) The difference between the best division and worst division here is 12 games. And look at the miserable AFC and NFC west. 23 and 22 wins respectively. That should win those divisions a collective award for futility. Since 2002, (when the NFL moves to 8 divisions of 4 teams) the fewest wins for a division was 25. Two divisions managed to break that record in ONE season. And A third division tied that record.
Lets put into perspective how bad the AFC and NFC west were with year. The Detroit Lions went 0-16. Zero wins. And their division managed 25 wins this season. 2 more than the AFC west and three more than the atrocious NFC west. The west divisions are terrible.
That is why it makes me so angry that both the Chargers and the Cardinals won play-off games this past weekend. Will some one remind the chargers they started the season 4 – 8 and are champions of the woeful AFC west. And will someone remind the cardinals that they are the Cardinals.
Not convinced the NFC west sucks? How about this. The net points for the teams in the NFC west this season were:
Rams: -233
Seahawks: -98
49ers: -42
and Cardinals: (drumroll please……………….) 1.
Let me repeat that. The Arizona Cardinals, winners of an actual NFL division and winners of a play-off game this year, outscored their opponents in the regular season 427 to 426.
Actually upon further inspection, in 2006, two teams made the play-offs with negative net points. The Giants were outscored by their opponents by 7 points and the Seahawks won the NFC west with a net total of -6 points on the season. In fact every team in the NFC west that year had negative net points. That is fairly impressive.
So congratulations to the Arizona Cardinals and the San Diego Chargers. Champions of futility. And also play-off winners.
Cheers.
Unhappiness in the wild
Happy New Year! Here is an article from the Freakonomics blog tracking American’s (un)happiness.
Cheers
Stats in the NFL: Part 1 (in the wild)
So the Jets blew it. They were 8-3 and they finished the season 1-4 to miss the playoffs at 9-7.
I was reading this article on the Freakonomics blog.
Attempts/completions: 175/98
Passing yards: 1,011
Touchdowns: 2
Interceptions: 9
Sacks: 9
Passer rating: 55.4
This is terrible. Make the guy retire. He WAS great. WAS.
So anyway, the Jets probably should have made the play-offs, but they just lost too many games. The Patriots, on the other hand, probably should have just made the play-offs. I know there are rules and criteria, but do the Cardinals (who got massacred by the patriots) or the Chargers (who started the season 4-8) really deserve to be in the play-offs over the Patriots?
Probably not, but the rules are what they are. There just aren’t any good teams (even mediocre teams) in either the AFC or NFC west this year.
This got me thinking. Don’t they always talk about parity in the NFL? How many commentators ever week do you hear throwing around parity this and parity that. Well where was the parity this year? Is there less parity in the league now than in past years? How can we measure parity?
Let’s start with what a good measure of parity would be. If there were perfect parity in the league every team would finish 8-8. This would be like flipping a coin to decide every game. (Not exactly the most compelling sports league.) The opposite (teams deviate from a record or 8-8), a lack of parity, can be thought of as entropy. Most notably over the last two season we have had a 16 win team and a 0 win team. Clearly, these teams were significantly better and worse, respectively, than the other teams in the league. So how can we measure parity. Lets put all 32 teams into a 32 by 2 contingency table. 32 rows, one for reach team and 2 columns, on for wins and one for losses. (This leads to fixed row and fixed column totals.)
We wish to test the null hypothesis that there is no association between the rows and columns. (ie the team you play for has nothing to do with the number of wins you get). Clearly this is not true and we will always reject the null hypothesis of no association, but we can compare by how much we reject the null hypothesis, namely the p-value. The smaller the p-value, the more parity in the league. While this is not a perfect measure, its and interesting start.
Since we have fixed row and column totals, normally this would lead to using Fisher’s exact test. However, with a 32 by 2 table this is computationally very intensive. Thus as an alternative we can use the Pearson Chi-Square test statistics and the G-squared test statistics. Here I report both statistics. I am going to rely more heavily on the G-squared statistics because of its close relationship with entropy.
95.62900 1.640269e-08 2008 32 7.95362
97.11005 9.714734e-09 2007 32 8.138756
69.63735 8.525854e-05 2006 32 4.704668
94.40966 2.518507e-08 2005 32 7.801207
79.79948 3.535543e-06 2004 32 5.974935
76.59059 9.912913e-06 2003 32 5.573824
56.98146 3.010793e-03 2002 32 3.122683
86.44988 2.244115e-07 2001 31 7.042142
79.80431 2.107774e-06 2000 31 6.198153
71.86353 2.721014e-05 1999 31 5.189674
(Does anyone know how to make nice tables in wordpress blogs?)
This jumbled mess summarizes the results of the G-squared test. Columns one and two are the G-squared test statistic and the respective p-value and column three is the year in which this regular season took place. (Note that the 2009 Super Bowl would correspond to the 2008 regular season.) Column four is the number of teams in the league that year and the last column is the number of standard deviations above the mean that the test statistic is. (p-value would be the best way to compare seasons, but the scale of p-value is difficult to visualize, so the graphs use standard deviation. Also the test statistics cannot be directly compared to one another because they have distributions with differing degrees of freedom.)
Over the past ten years, using the p-values of the G-squared statistic the years with the most entropy were 2005, 2007, and 2008. The years with the most parity were 2002, 1999, and 2006.
Time for some pictures.
The first graph is a plot of NFL season versus what I am calling entropy (Number of standard deviations above the mean of the distribution of the test statistic.) I have also labeled each year with the super bowl champion and the number of wins they had in the regular season. Notice that over the last four years we observe the three highest amounts of entropy.

Further note 2002. Lets take a closer look at this year. There were no teams with 13 wins and every team had at least two wins. This is compared to 2007 when there were FOUR teams with at least THIRTEEN wins (Dallas, Green Bay, Indianapolis, and undefeated New England) and one team (Miami) team with only one win. The histograms of wins in the 2002 and 2007 seasons are below.


Look how tightly bunched the 2002 teams are in the middle and compare that with the 2007 season.
One last picture. The histogram of the 2008 NFL season.

Conslusions:
According to my measure of entropy, the level of parity in the NFL over three of the last 4 seasons has been very low. There does not, however, appear to be any upward trend in the amount of parity; rather, it seems as if the level of parity in the league varies trendlessly from year to year.
2002 was a season with unusually high parity with many teams finishing with similar records.
One final thought: I would argue that parity is bad for the league. When there is no standout team, it is difficult to market exciting games. Isn’t it more compelling to watch a play-off game featuring teams who absolutely dominated the regular season (Think Green Bay, Dallas, New England, and Indianapolis from 2007), than a slug fest of mediocrity between two teams that made it into the play-offs by default (I’m looking at you Arizona and San Diego). When parity is high, everyone is mediocre, but someone has to win by default. When there is high entropy good teams exist. I’ll take the latter any day of the week.
Stats in golf (in the wild)
Within the past few years I’ve started golfing fairly regulary. Last year, a few friends of mine and myself, started tracking our progress on Oobgolf. We enter out scores, and they automatically track our handicaps. (I’m a 20.4 by the way).
Anyway, at the end of the season, we had a big single elimination tournament with our handicaps. At some point during the tournament we got to talking about how two people could have the same handicap and be entirely different players.
Here is an extreme example:
Handicap is calculated using the best 10 scores from your last 20 rounds. Golfer one could play 20 rounds and shoot 90 in all of them. Golfer two could shoot 90 ten times and then 110 the other ten times. Both these golfers would have the same handicap, but if you were going to play for money (or in our big end of the season tournament) you’d rather play golfer two, even thought golfer one and golfer two have the same handicap.
We had a brief discussion about how you could quantify this disparity. Apparently, however, some other people have thought a lot harder about this.
This article in Chance from 2001 discusses how a “Steady Eddie” has an advantage over a “Wild Willie”. (The chance article uses order statistics so you might want to check out this link for a brief description.)
Then this article proposes a measure called Anti-Handicap which mesaures your worst ten scores in your last twenty rounds. By comparing a golfers handicap and anti-handicap, some measure of variability in a golfers game can be assessed. As before in our extreme example, gofler one would have a handicap of 18 and an anti-handicap of 18. However, golfer two would have a handicap of 18, but an anti-handicap of 38.
Cheers.
Best Statistical Graph ever drawn in the Wild
Finals are over and I hope to post more regularly again. Here is a quick picture.

This picture, by French engineer Charles Joseph Menard, graphically depicts Napolean’s fateful march to Russia. The width of the line represents how many troops Napolean had at each point on his way to Russia and what makes this graphic so great is just how many different variables are all displayed at once.
Edward Tufte says in his book, The Visual Display of Quantitative Information, “Minard’s graphic tells a rich, coherent storhy with its multivariate data, far more enlightening than just a single number bouncing along over time. Six variables are plotted: the size of the army, its location on a two-dimensional surface, direction of the army’s movement, and temperature on various dates during the retreat from Moscow”.
In the last line of the description below the graph, Edward Tufte says, “It may well be the best statistical graphic ever drawn,” which, in my opinion, may be the best claim ever made about a statistical graphic ever.
Cheers.
For Stanley:
Complete caption of the graphic:
“This classic of Charles Joseph Minard (1781-1870), the French engineer, shows the terrible fate of Napolean’s army in Russia. Described by E. J. Marey as seeming to defy the pen of the historian by its brutal eloquence, this combination of data map and time-series, drawn in 1861, portrays the devastating losses suffered in Napolean’s Russian campaign of 1812. Beginning at the left on the Polish-Russian border near the Niemen River, the thick band shows the size of the army (422,000 men) as it invaded Russia in June 1812. The width of the band indicates the size of the army at each place on the map. In September, the army reached Moscow, which was by then sacked and deserted, with 100,000 men. The path of Napolean’s retreat from Moscow is depicted by the darker, lower band, which is linked to a temperature scale and dates at the bottom of the chart. It was a bitterly cold winter, and many froze on the march out of Russia. As the graphic shows, the crossing of the Berezina River was a disaster, and the army finally struggled back into Poland with only 10,000 men remaining. Also shown are the movements of auxiliary troops, as they sought to protect the right flank of the advancing army. Minard’s graphic tells a rich, coherent story with its multivariate data, far more enlightening than just a single number bouncing along over time. Six variables are plotted: the size of the army, its location on a two-dimensional surface, direction of the army’s movement, and temperature on various dates during the retreat from Moscow. It may well be the best statistical graphic ever drawn.”
Stats in Baseball: Runs per game (in the wild)
Here is a good ESPN article from 2002 about a baseball statistic called runs per game developed by Harvard professor of statistics Carl Morris.
Cheers.