Major League Baseball Hall of Fame Voting and Dimension Reduction
The major league baseball hall of fame votes came out yesterday. No one got in. The big issue that everyone is talking about is how to vote for players from the steroids era. Some voters don’t care and other voters are taking moral stands against players like Clemens and Bonds. Essentially, a vote for Bond was a vote for Clemens and can be visualized nicely in this heatmap. Just about everyone who voted for Clemens voted for Bonds. I wanted to look into this a little more. Using data from the ballots that are public (collected here (Thank you for collecting this)), I used some dimension reduction techniques to visualize both the variability between players and voters.
First, using the votes received by each player as a vector of ones and zeroes, the distance between each player can be calculated using a distance metric. The multidimensional scaling will project these distances onto a two dimensional surface so that the distances are preserved as well as possible. The image below shows the results, with the size of the players name relative to the number of votes they received on public ballots. The axes here don’t really have any clear interpretation, all we are interested in is how far apart two names are. You’ll immediately notice that Clemens and Bonds are right next to each other. In fact, I had to manually move Clemens name up slightly since Clemens name was literally on top of Bonds. The other manual adjustment I had to make was for the Lee Smith, Alan Trammell, Edgar Martinez cluster. These names were all right on top of each other, also. Overall, this seems to be display two separate issue in voting this year: how good you were and are you tainted by steroids. The closer a player is to the Bonds/Clemens group, the greater the smell of steroids and PEDs. The closest players to Bonds/Clemens are Piazza and Bagwell to the left and McGwire, Sosa, and Palmeiro on the right. The difference between Bagwell and Piazza vs McGwire, Sosa, and Palmeiro is the first group received more votes than the latter. So, generally speaking it looks like better players are further to the left and players associated with steroids and PEDs are closer to the bottom. Also, totally unrelated to anything, doesn’t this plot sort of resemble a heart? I’m not sure what that means. But I definitely see it.
Probably more interesting than looking at the players is breaking down the voters. To do this, I used principal component analysis and then plotted one PC against another. It’s immediately evident that the voters fall into two main groups, and I have added some colors to highlight these differences. The red group in this first plot are the voters who did not vote for either Bonds nor Clemens, and the blue group voted for both Clemens and Bonds. Only three voters on public ballots, highlighted in green, voted for Bonds and not Clemens, and no public ballot had a vote for Clemens and not Bonds. Some interesting ballots standout on this plot immediately, like Bob Padecky. He seems to be very far away from the other voters and a quick look at his ballot tells you why. He voted for Clemens, Bonds, McGwire, Sosa, and Palmeiro and no one else. Not many other voters voted this way. At the other end of the steroids spectrum you have guys like Peter Gammons and Kirby Arnold. Gammons and Arnold both voted for Bagwell, Biggio, Edgar Martinez, Morris, Piazza, Raines, Schilling, and Trammell while Arnold additionally voted for McGriff and Lee Smith and Gammons added Larry Walker.
While the first principal component is mostly driven by voters opinions on steroids, the second principal component is not so obvious. It turns out that Curt Shilling does a pretty good job explaining it though. The plot below if that same as the previous, but the Schilling voters are highlighted with orange and purple. So you can see that voters towards the bottom of the graph tended to vote for Schilling and the voters towards the top did not. So what we have he is the upper right in red consist of voters who did not vote for Bonds no Schilling and the lower left in purple is voters who voted for both Schilling and Bonds. Then voters in the upper left tended to vote for Bonds and not Schilling wheres voters in the lower right voted for Schilling and not Bonds. This is why I have labeled the x-axis “Bonds/Clemens” and the y-axis “Schilling”.
The first and second principal components explain about 34% of the variability of the voters, and the first 4 explain almsot exactly 50% of the variability. The plot below plots the third principal component against the fourth. The divisions aren’t as clear as the previous plot, but there is still some groups that can be discerned. It seems that the third and fourth PC are being driven by Jack Morris and Alan Trammell, respectively. Finally, remember what we are looking at. Principal components seek to explain variability, in this case, between voters, so the principal components are going to align with those dimensions that explain the most variability. In other words, the PC are highlighting the controversial candidates where there is substantial disagreement, and most of the disagreement, seen in the first PC, can be explained players associated with steroids such as Bonds and Clemens. These are highly polarize characters in general and especially when it comes to Hall of Fame voting. Following these guys, the next largest sources of variability loosely are associated with Schilling, Morris, and Trammell all of whom seem to be highly debated candidates for the Hall of Fame.