Stats in Baseball: Part 1 (in the wild)

I like stats. I also like baseball. So what could I love more than baseball statistics.

href=””>Bill James. He is widely considered to be the father of baseball statistics, or sabermetrics.

Bill James came up with a whole bunch of very clever ways to analyze baseball using statistics, including runs created, range factor, and win shares.

Another one of his stats is Pythagorean expectation:

This statistics works very well for predicting wins, but it doesn’t really make an sense. Why does it work? I was wondering how this statistics would compare with a multiple regression based on the same data.

Using the 2008 MLB baseball data of wins, runs scored, and opponent runs scored by team. I compared the predictions for expected wins made by Pythagorean expectation versus a simple multiple regression model of the form Wins=Runs scored+Opponent runs scored+error. The root mean squared error for the pythagorean expectation was 7.37. The root mean squared error for prediction for the regression model was 4.18. While Pythagorean expectation does a very good job predicting win percentage, a multiple linear regression does a better job. Also, the regression model has coefficients that can be interpreted practically while Pythagorean expectation works well, but offers very little reasoning as to why it works well.

The model was: predicted wins=79.7416+.1025*Runs scored-.1088*Opponent runs scored. This model has R-squared=.8528. So on the average, approximately every extra ten runs a team scores is worth a win and every extra ten runs given up is equal to a loss.

What happens if you build a regression model with more than just runs and opponent runs as predictor variables. Using the same 2008 data, a model for predicting wins in 2008 is:

Predicted wins=60.75-71.15*WHIP+.11768*SB+271.133*OBP+.10984*HR

It seems that you can break down winning baseball games into four factors:
1.) Pitching
2.) Speed
3.) Contact hitting
4.) Power hitting

I realize that’s not a shocking revelation, but it’s neat to see it even with this small data set.

I was a little bit surprised to see that SB shows up because a common theory is that stealing bases is not worth the risk, but it shows up very strongly in this model.

So I looked it up:
Top 5 teams in SB for 2008
1.) Tampa Bay Rays
2.) Colorado Rockies
3.) New York Mets
4.) Philadelphia Phillies
5.) Los Angeles Angels

Bottom 5 in SB in 2008
30.) San Diego Padres
29.) Pittsburgh Pirates
28.) Arizona Diamondbacks
27.) Atlanta Braves
26.) Detroit Tigers

3 of the top 5 and 5 of the top 7 teams made the playoffs. Interesting.

I’ll end with a quote from the greatest base stealer of all time: “It took a long time, huh? [Pause for cheers] First of all, I would like to thank God for giving me the opportunity. I want to thank the Haas family, the Oakland organization, the city of Oakland, and all you beautiful fans for supporting me. [Pause for cheers] Most of all, I’d like to thank my mom, my friends, and loved ones for their support. I want to give my appreciation to Tom Trebelhorn and the late Billy Martin. Billy Martin was a great manager. He was a great friend to me. I love you, Billy. I wish you were here. [Pause for cheers] Lou Brock was the symbol of great base stealing. But today, I’m the greatest of all time. Thank you.
—Rickey Henderson’s full speech after breaking Lou Brock’s record


Posted on November 12, 2008, in Baseball. Bookmark the permalink. Leave a comment.

Leave a Reply

%d bloggers like this: