Factor Analysis in the stock market (in the wild)
Well, I’m done with my qualifying exam. I’ll know if I passed by late this week/early next week.
Anyway, here is a short project that I did on factor analysis in November.
Cheers.
Introduction
A major market index in the United States is the Dow Jones Industrial Average. Thirty large industrial companies stock prices contribute to the calculation of the Dow Jones industrial average. These companies are Boeing, Caterpillar, Chevron, Citigroup, Coca Cola, DuPont, Exxon Mobil, General Electric, General Motors, Hewlett Packard, Home Depot, IBM, Intel, Johnson and Johnson, JP Morgan Chase, Kraft Foods, McDonalds, Merck, Microsoft, Pfizer, Proctor and Gamble, United Tech, Verizon, WalMart, Walt Disney, Bank of America, AT and T, American Express, Alcoa, and 3M Company.
The amount of change in the price of these stocks will be highly correlated, as they are all part of the larger market. Factor analysis will be used to reduce the dimensionality of the 30 stocks in the Dow Jones average. This is being done because I am interested to see which stock’s prices move together.
Data
Data was collected from the website finance.yahoo.com. Data consists of the high, low, opening, and closing price of each of the thirty stocks as well as the volume of each stock for each day. Stocks vary in the length for which they have historical data, as some companies have been public longer than other. As such only the last 1000 trading days are considered in the analysis. This includes all data dating back to November 19, 2004. Rather than consider the actual price of the stock (since some stock prices are much higher or lower than others), the change in stock price from one closing bell to the next is considered for all thirty stocks.
Analysis
Using SAS 9.2, a factor analysis was implemented for the differences in closing prices for the 30 Dow Jones stocks over the last 1000 days. Using a scree plot \cite{scree} and by analyzing the eigenvalues of the correlation matrix, a sufficient number of factors will be chosen. Upon finding the principal components, the varimax \cite{Johnson} method will be used to find a final rotated factor solution.
Stock | Factor 1 | Factor 2 | Factor 3 | Factor 4 | Factor 5 |
AA | 0.23127 | 0.09893 | 0.23084 | 0.73216 | 0.04921 | AXP | 0.67590 | 0.27178 | 0.30486 | 0.21506 | 0.10536 | BA | 0.22446 | 0.24467 | 0.32912 | 0.32123 | 0.33817 | BAC | 0.82352 | 0.25896 | 0.14310 | 0.14884 | 0.13932 | C | 0.79975 | 0.24980 | 0.14993 | 0.15357 | 0.09632 | CAT | 0.19406 | -0.06364 | 0.12838 | 0.45805 | 0.38608 | CVX | 0.13066 | 0.40635 | 0.20625 | 0.76029 | 0.08143 | DD | 0.39500 | 0.30897 | 0.27833 | 0.43173 | 0.30426 | DIS | 0.37520 | 0.43313 | 0.42601 | 0.27117 | 0.13823 | GE | 0.61010 | 0.27886 | 0.31333 | 0.21191 | 0.14694 | GM | 0.50475 | 0.06398 | 0.15834 | 0.19736 | 0.00832 | HD | 0.53510 | 0.24265 | 0.37274 | 0.02793 | 0.25287 | HPQ | 0.24719 | 0.19816 | 0.70506 | 0.24809 | 0.02544 | IBM | 0.33248 | 0.19050 | 0.68158 | 0.21392 | 0.10399 | INTC | 0.32603 | 0.20892 | 0.61192 | 0.21961 | 0.11821 | JNJ | 0.15935 | 0.71031 | 0.23464 | 0.10390 | 0.17792 | JPM | 0.80200 | 0.25962 | 0.19675 | 0.08128 | 0.15698 | KFT | 0.28979 | 0.45453 | 0.18234 | 0.20696 | 0.14068 | KO | 0.10981 | 0.60928 | 0.39598 | 0.08037 | 0.18114 | MCD | 0.26140 | 0.40935 | 0.36831 | 0.15274 | 0.36240 | MMM | 0.35237 | 0.31052 | 0.31527 | 0.35182 | 0.22999 | MRK | 0.18019 | 0.67967 | 0.06238 | 0.17023 | -0.06161 | MSFT | 0.14925 | 0.35621 | 0.65338 | 0.20696 | 0.12790 | PFE | 0.37601 | 0.57298 | 0.08865 | 0.15824 | -0.00815 | PG | 0.20504 | 0.69431 | 0.20156 | 0.17265 | 0.25142 | T | 0.36670 | 0.53525 | 0.32948 | 0.27273 | 0.00919 | UTX | 0.13055 | 0.15316 | 0.07721 | 0.11207 | 0.79017 | VZ | 0.37186 | 0.52181 | 0.37188 | 0.19101 | 0.01283 | WMT | 0.37782 | 0.46428 | 0.35919 | 0.06918 | 0.23774 | XOM | 0.13787 | 0.44943 | 0.21108 | 0.74470 | 0.09553 |
Results
Keeping five factors, we can see see which stocks load heavily onto which factors by looking at the table. The variables that load heavily onto the first factor include, American Express (AXP), Bank of America (BAC), Citigroup (C), General Electric (GE), General Motors (GM), Home Depot (HD), and JP Morgan (JPM). With the exception of Home Depot and General motors, all of these companies are financial institutions, and General Motors and Home Depot are heavily affected by the availability of credit from these institution as GM sells large ticket items (cars) and HD is heavily tied to people buying houses, and thus affected by the mortgage market. It appears that this first factor explains variation related to the financial sector.
The companies that are heavily loaded onto the second factor include, Chevron (CVX), Disney (DIS), Johnson and Johnson (JNJ), Kraft Foods (KFT), Coca Cola (KO), McDonalds (MCD), Merck (MRK), Pfizer (PFE), Proctor and Gamble (PG), AT and T (T), Verizon (VZ), Wal-Mart (WMT), and Exxon-Mobil (XOM). All of these companies sell items directly to consumers, and the costs involved in each of these transactions with consumers is relatively small. So, it appears this second factor is explaining the variation due to the individual consumer.
The third factor includes Disney, Hewlett-Packard, IBM, Intel, and Microsoft. These companies, with the glaring exception of Disney, are all companies tied to computers. Thus, it appears that the third factor explains variation due to computer industry. While factor four include companies such as Alcoa, Cat, Chevron, DuPont, and Exxon-Mobil. This factor appears to explain variation in the manufacturing market. Both Chevron and Exxon-Mobil appear heavily loaded on both factor 2 and factor 4. This makes sense since both companies can essentially break down their earnings into two components, individual consumer sales and sales to other businesses.
Factor five includes United Technologies by itself, which is interesting because UTX hold such a large variety of companies including, Carrier, Hamilton-Sundstrand, Otis elevators, Pratt and Whitney, and Sikorsky Helicopter.
Conclusions
The movement in stock price of the 30 stocks which comprise the Dow Jones Industrial Average are highly correlated. As such they are a prime candidate for a factor analysis and a dimensionality reduction. Using five factors, we can group the variability in the stock market into categories. Roughly speaking the three categories that explain the most variation are financial, consumer goods, technology. The fourth and fifth factor seem to represent approximately the same dimension, namely, manufacturing and industry.
Using this factor analysis, we no can now view fluctuations in the stock market based on groups rather than the individual stocks. We have reduced the dimensionality of the stock in the Dow Jones from 30 down to 5, while still explaining 60 percent of the variability, greatly simplifying analysis of this stock data.
Future work in this direction could include using more than the past 1000 days of data and possibly including more than 30 stocks in the factor analysis.
Posted on January 20, 2009, in Uncategorized. Bookmark the permalink. 11 Comments.
This analysis is interesting. Now, on to prediction, heh heh!
If I could predict the stock market, this blog would be called Straight Cash in the Wild.
Strangely the browser I have does not show your page as it should… It appears that a whole chunk of if is not showing and the skin of the article does not appear to be right. Are you sure this page has been set up for Google Chrome?
I never thought of it that way, well put!
Hey, could you post the data? I would like to do some analysis on it of my own…
I don’t have the data anymore. That post was from 5 years ago. But Yahoo or Google should be easy to scrape to get the data that you are interested in. Check out the R package XML or the Python package BeautifulSoup.
Cheers,
Greg
Since factor analysis is cross-sectional, do the results posted above come from the latest data, meaning the latest day in the 1000 days you studied? or are these an average of the 1000 days?
Since factor analysis is cross-sectional, are the results above based on the last data point, meaning the last day of the 1000 days you studied? or is this an average of the 1000 days?
I guess I don’t understand the question. The results are based on all 1000 days.
Cheers,
Greg
I mean did you calculate the percentage return for these stocks on a daily basis for the past 1000 days?
I believe I used closing price for each day.