Category Archives: R
Olympics Box Plots: Part 2 / ggplot2 Shoutout
So, last Monday I posted some Olympics boxplots, and then I left for the week to go golfing with my father (We play 7 rounds in 4 days, you take your best score on each hole, add them up and declare a winner. I won 66-68 this year. Pops now leads the all-time series 3-2). When I came back, the blog has thousands of hits, which I assumed initially to be a mistake. Turns out, however, those boxplots ended up on the front page of FlowingData, which has way more readers than I do.
While I’m excited to be mentioned on FlowingData, I don’t really think the basic R graphics boxplots really live up to the standard set over there (Nathan Yau described my plots as as “barebones”, and he’s right). I mean this post of “Every Idea in History” is just straight up impressive. So since I’ve been trying to learn a new graphing package in R, I decided to update the Olympics plots using ggplot2 to try and look a bit more professional.
In my very limited usage of ggplot2, I have found that it is a little bit harder to use than base graphing and plotting in R. However, I suspect this is only due to the fact that I am used to one way of plotting in R and, I suspect, as I use ggplot2 more often I won’t believe that I ever used the old way of plotting (its like when I switch from Windows to Unix: A little bit harder to learn, but much better). While it’s taking me a bit of time to learn, I also have to say that, once you figure out the code, its much easier to get exactly what you want. For instance, when I originally did my Olympics plots I wanted to order the sports by median age and then split each sport by gender. I had quite a difficult time doing this in base graphing in R (and just ended up not doing it all together), but ggplot handled this very easily (the code is at the end).
Below are side-by-side box plots of the ages of olympics athletes sorted by median age of the competitor in each sport first for the years 2000-2008 followed by all years. Within each sport the gender of the competitors is separated out into the appropriate number of box plots, so now the gender distribution within each sport can be easily compared to one another.
Finally, two things I should note that I didn’t mention in the first post:
- I’ve added a small bit of noise to each of the ages so that the outliers can be seen more clearly.
- This data that I am using is not the complete set of Olympians all time, though it is the vast, vast, majority of them. When I was scraping, some athletes’ names contained non-standard characters (e.g. é or ü), and these had to be converted to the English alphabet equivalent (e.g. e or u). While I manually corrected many of these, I do not believe that I corrected all of them. So, there are probably a few Olympians missing from my data set, though I believe it is a very, very small number relative to the total number of athletes.
Cheers.
p<-qplot(reorder(factor(Sport),Age.median), Age,fill=factor(Sex),data=dat.summer,geom=”boxplot”)
p+ scale_fill_manual(name = “”, values = c(“green”,”pink”, “blue”), labels = c(“B” = “Both”, “F” = “Female”,”M”=”Male”))+ xlab(“Sport”)+opts(axis.text.x=theme_text(angle=-90))+ opts(title=”Age Distribution of Olympic Athletes by Age and Gender: 2000-2008″)
Olympics Boxplot
[7/23/2012 Addition: I’ve updated these plots using ggplot2 to look nicer. They can be found here.]
Recently, I saw this pretty cool chart at the Washington Post (I originally saw the chart at this wonderful blog here) about the ages of olympians from the past three olympics. I commented to myself that I thought it would be more interesting with boxplots of the data, rather than simple ranges, and I also wondered what it would look like if we used data from all of the past olympics.
So, I wrote some R code and began scraping sports-reference.com/olympics to get a data set with all of the olympic athletes from all of the games. This took me quite some time (and work kept getting in the way), but I eventually got it right and collected the data.
Here are some of the resulting graphs:
Below is a graph of side-by-size boxplots of age for each sport by gender with blue for male, pink for female, and green for mixed competition. And no the 11 year old female swimmer is not a typo like I originally thought.
The previous graph was kind of messy, so I’ve sorted this one by median age. Not surprisingly female gymnastics and rhythmic gymnastics have the lowest median ages of competitors while equestrianism has the highest median age of competitor at over 35 years of age.
The previous two graphs were only for the years of 2000-2008, so I re-did the previous graph using data from all of the olympics. Since the obvious question arising from this graph is is “What is roque?”, I have saved you the trouble of googling it by providing a wikipedia link for roque.
This graph is boxplots of age by year with the color representing the host continent.
[7/15/2012 Correction: The original post had the 1956 box colored blue for Europe. However, commenter Mules points out that 1956 should actually be yellow for Australia. They are correct and the correction has been made. However, as I point out in response, I’m not totally wrong: The equestrian events had to be held in Stockholm, Sweden due to quarantine restrictions.]
[7/23/2012 Correction: The graph below had some mistakes in it, including an olympian who was over 90. This was pointed out by Kate, and has been corrected.]

And finally, we have overall age by gender.
Cheers.
MLB rankings – 7/9/2012
StatsInTheWild MLB rankings as of July 9, 2012 at 1:22pm. SOS=strength of schedule
| Team | Rank | Change | Record | ESPN | TeamRankings.com | SOS | Run Diff |
| NYY | 1 | ↑1 | 52-33 | 1 | 1 | 4 | +65 |
| Texas | 2 | ↓1 | 52-34 | 2 | 2 | 13 | +79 |
| LA Angels | 3 | ↑1 | 48-38 | 4 | 4 | 11 | +44 |
| ChiSox | 4 | ↑2 | 47-38 | 6 | 5 | 14 | +63 |
| Boston | 5 | ↓2 | 43-43 | 17 | 13 | 6 | +43 |
| Toronto | 6 | ↓1 | 43-43 | 18 | 12 | 2 | +22 |
| Washington | 7 | – | 49-34 | 3 | 3 | 21 | +58 |
| TampaBay | 8 | – | 45-41 | 14 | 8 | 3 | +4 |
| Detroit | 9 | ↑2 | 44-42 | 16 | 11 | 10 | +6 |
| Baltimore | 10 | ↓1 | 45-40 | 11 | 7 | 1 | -36 |
| Oakland | 11 | ↓1 | 43-43 | 19 | 16 | 7 | +3 |
| Cincinnati | 12 | – | 47-38 | 8 | 9 | 25 | +42 |
| St. Louis | 13 | ↑2 | 46-40 | 13 | 17 | 26 | +70 |
| Atlanta | 14 | – | 46-39 | 12 | 10 | 20 | +34 |
| Pittsburgh | 15 | ↑4 |
48-37 | 5 | 6 | 27 | +32 |
| Cleveland | 16 | ↑1 | 44-41 | 15 | 14 | 12 | -29 |
| NY Mets | 17 | ↓4 | 46-40 | 10 | 15 | 17 | +20 |
| LA Dodgers | 18 | – | 47-40 | 7 | 19 | 28 | +10 |
| Seattle | 19 | ↑2 |
36-51 | 26 | 24 | 5 | -28 |
| SF | 20 | ↓4 | 46-40 | 9 | 18 | 30 | -8 |
| Kansas City | 21 | ↑1 | 37-47 | 24 | 21 | 9 | -41 |
| Arizona | 22 | ↓2 | 42-43 | 20 | 20 | 29 | +10 |
| Miami | 23 | – | 41-44 | 21 | 22 | 15 | -56 |
| Milwaukee | 24 | ↑1 | 40-45 | 22 | 25 | 24 | -9 |
| Minnesota | 25 | ↑1 | 36-49 | 25 | 23 | 8 | -87 |
| Philadelphia | 26 | ↓2 |
37-50 | 23 | 26 | 16 | -28 |
| Chic Cubs | 27 | ↑2 | 33-52 | 29 | 27 | 18 | -69 |
| Houston | 28 | ↓1 | 33-53 | 28 | 30 | 19 | -72 |
| Colorado | 29 | ↓1 | 33-52 | 30 | 28 | 22 | -66 |
| San Diego | 30 | – | 34-53 | 27 | 29 | 23 | -76 |
Past Rankings:
Cheers.
Coke, Pop, or Soda
I found this nice picture below over at Edwin Chen’s blog post about Soda vs Pop with Twitter. Enjoy!
Cheers.
State Capitals, Baseball, and an R scrape
Slate has an awesome sports podcast called “Hang Up and Listen” featuring Stefan Fatsis, Josh Levin, and Mike Pesca. Every week they do a trivia question, and this weeks question was:
What current major league player’s first and last name are state capitals.
(The answer is one of my top all-star snubs, if you need a hint.)
Anyway, I got to thinking if there were any other players who had state capital names. So I used R and the XML package to scrape baseball-almanac.com for a list of all major league players and then search each of their names for state capitals.
In alphabetical order of state I found:
- Montgomery (Al, Bob, Jeff, Monty, Ray, and Steve)
- Steve Phoenix
- There are 6 Little’s (Bryan, Harry, Jack, Jeff, Mark, and Scott) and 1 Rock (Les). So I’m going to count that for Arkansas. (I also found Rocky Stone who has been added to the Rip Torn hall of fame….)
- Denver Grigsby
- Bruce Hartford
- Gene and Mike Lansing
- There are 34 players with the last name Jackson, and one (Jackson Todd) with the first name Jackson.
- Jefferson (Jesse, Reggie, and Stan)
- Carson (Al, Kit, Matt, and Robert)
- Raleigh Aitchison and John Raleigh
- Juan Pierre, Pierre Roy, and Max St. Pierre
- There are 10 Austin’s. 6 with first name Austin (Jackson(!), Kearns, Walsh, Knickerbocker, McHenry, and Romine) and 4 with last name Austin (Jeff, Jim, Jimmy, and Rick)
- For Utah you’d have to go with Jarrod Saltamacchia, Salty Parker, or Jack Saltzgaver and one of the 4 Lake’s (Eddie, Fred, Joe, or Steve). Although, that is stretching it a bit.
- Richmond (Beryl, Don, John, Lee, Ray, and Scott)
- Madison (Art, Dave and Scotti). And of course Madison Bumgarner.
Cheers.
MLB Rankings – 7/2/2012
StatsInTheWild MLB rankings as of July 2, 2012 at 10:37am. SOS=strength of schedule
The last rankings came out on June 25, and since then the Dodgers have gone 1-6. Going back two weeks, to June 19, they are 2-11. As a result of this they fall 8 spots this week, landing them at number 18 in my rankings. TeamRankings.com has them even lower at 19, and, for some reason, ESPN has them ranked number 4 in their June 25 power rankings. (I’ll update the ESPN rankings as soon as they are posted.)
| Team | Rank | Change | Record | ESPN | TeamRankings.com | SOS | Run Diff |
| Texas | 1 | – | 50-30 | 1 | 2 | 13 | +100 |
| NYY | 2 | – | 48-30 | 2 | 1 | 5 | +61 |
| Boston | 3 | ↑2 | 42-37 | 10 | 6 | 6 | +57 |
| LA Angels | 4 | ↑3 | 44-35 | 6 | 4 | 8 | +44 |
| Toronto | 5 | ↓1 | 40-39 | 17 | 8 | 3 | +29 |
| ChiSox | 6 | ↑2 | 42-37 | 13 | 10 | 14 | +42 |
| Washington | 7 | ↑2 | 45-32 | 3 | 3 | 21 | +48 |
| TampaBay | 8 | ↓5 | 41-38 | 7 | 9 | 2 | -1 |
| Baltimore | 9 | ↓3 | 42-36 | 8 | 5 | 1 | -26 |
| Oakland | 10 | ↑1 | 38-42 | 20 | 18 | 7 | -2 |
| Detroit | 11 | ↑4 | 39-40 | 18 | 16 | 11 | -7 |
| Cincinnati | 12 | ↑1 | 43-32 | 5 | 7 | 30 | +33 |
| NY Mets | 13 | ↑3 | 43-37 | 11 | 12 | 17 | +22 |
| Atlanta | 14 | – | 41-37 | 12 | 11 | 16 | +21 |
| St. Louis | 15 | ↓3 |
41-38 | 16 | 17 | 28 | +57 |
| SF | 16 | ↑3 | 45-35 | 9 | 14 | 29 | +16 |
| Cleveland | 17 | ↑1 | 40-38 | 15 | 15 | 12 | -37 |
| LA Dodgers | 18 | ↓8 | 44-36 | 4 | 19 | 26 | +18 |
| Pittsburgh | 19 | ↑2 |
42-36 | 14 | 13 | 27 | +6 |
| Arizona | 20 | ↓3 | 39-39 | 19 | 20 | 25 | +13 |
| Seattle | 21 | ↓1 | 34-47 | 25 | 23 | 4 | -30 |
| Kansas City | 22 | – |
35-42 | 24 | 21 | 10 | -37 |
| Miami | 23 | ↑1 | 38-40 | 21 | 22 | 15 | -58 |
| Philadelphia | 24 | ↓1 | 36-45 | 22 | 26 | 18 | -15 |
| Milwaukee | 25 | – | 36-42 | 23 | 25 | 24 | -11 |
| Minnesota | 26 | – |
33-45 | 27 | 24 | 9 | -85 |
| Houston | 27 | – | 32-47 | 26 | 27 | 20 | -53 |
| Colorado | 28 | – | 30-48 | 28 | 28 | 23 | -56 |
| Chic Cubs | 29 | – | 29-49 | 29 | 29 | 19 | -71 |
| San Diego | 30 | – | 30-50 | 30 | 30 | 22 | -78 |
Past Rankings:
Cheers.
MLB rankings – 6/25/2012
StatsInTheWild MLB rankings as of June 25, 2012 at 4:37pm. SOS=strength of schedule
| Team | Rank | Change | Record | ESPN | TeamRankings.com | SOS |
| Texas | 1 | ↑1 | 45-28 | 1 | 2 | 13 |
| NYY | 2 | ↓1 | 43-28 | 2 | 1 | 4 |
| Tampa Bay | 3 | – | 40-32 | 7 | 4 | 3 |
| Toronto | 4 | ↑1 | 37-35 | 17 | 8 | 1 |
| Boston | 5 | ↑1 | 38-34 | 10 | 7 | 5 |
| Baltimore | 6 | ↓2 |
41-31 | 8 | 3 | 2 |
| LA Angels | 7 | – | 40-33 | 6 | 6 | 8 |
| ChiSox | 8 | ↑2 | 38-34 | 13 | 14 | 14 |
| Washington | 9 | – | 41-29 | 3 | 5 | 21 |
| LA Dodgers | 10 | ↓2 |
43-30 | 4 | 12 | 28 |
| Oakland | 11 | ↑1 | 35-38 | 20 | 18 | 7 |
| St. Louis | 12 | ↑3 | 38-35 | 16 | 17 | 29 |
| Cincinnati | 13 | ↓2 | 39-32 | 5 | 9 | 30 |
| Atlanta | 14 | – | 38-34 | 12 | 10 | 17 |
| Detroit | 15 | ↓2 |
35-37 | 18 | 20 | 11 |
| NY Mets | 16 | ↑1 | 39-34 | 11 | 11 | 16 |
| Arizona | 17 | ↑3 | 37-35 | 19 | 19 | 26 |
| Cleveland | 18 | – | 37-34 | 15 | 16 | 12 |
| SF | 19 | ↓3 |
40-33 | 9 | 15 | 25 |
| Seattle | 20 | ↓1 | 31-43 | 25 | 24 | 6 |
| Pittsburgh | 21 | ↑2 | 38-33 | 14 | 13 | 27 |
| Kansas City | 22 | ↓1 |
31-39 | 24 | 21 | 10 |
| Philadelphia | 23 | ↓1 | 34-40 | 22 | 23 | 18 |
| Miami | 24 | – | 34-38 | 21 | 22 | 15 |
| Milwaukee | 25 | – | 33-39 | 23 | 26 | 24 |
| Minnesota | 26 | – |
29-42 | 27 | 25 | 9 |
| Houston | 27 | – | 30-42 | 26 | 27 | 23 |
| Colorado | 28 | – | 27-44 | 28 | 28 | 22 |
| Chic Cubs | 29 | – | 24-48 | 29 | 30 | 20 |
| San Diego | 30 | – | 26-47 | 30 | 29 | 19 |
Past Rankings:
Cheers.
Strikeouts on the rise
Someone recently mentioned to me that strikeouts in the major leagues were at an all time high. So I did what anyone would naturally do: Write some R code to scrape baseball-reference.com, collect team data for every team over the past 112 years, and plot it. The results are below:
Things to notice:
- The strikeout rate for the first 15 years of the 20th century was relatively flat at around 10%.
- There have been two major drops in strikeout rates. The first was from about 1915 through 1920, the second was between the late 1970s through 1980.
- The first drop in strikeout rates was around the beginning of Babe Ruth’s when power hitting became a more prominent part of the game.
- The second major drop followed a rules change where the mound was lowered from 15 inches to 10 inches for the 1969 season.
- These two small periods of rapid decline were both followed by long stretches of slowly increasing strikeout rates. Strikeout rates steadily climbed from about 1920 through the mid-1960s, and then again from 1980 to present.
- In 1973 the American League introduced the designated hitter (DH). Before 1973, the American and National leagues had very similar strikeout rates. After 1973, one can see a clear separation of the leagues as the National league, not surprisingly, has had a higher strikeout rate than the American league every year since the beginning of the DH era.
- The team with the highest strikeout rate in the last 112 years was the 2010 Arizona Diamondback who finished 65-97 with a strikeout rate of nearly 25%. Before 1980, the team with the highest strikeout rate was the 1968 New York Mets.
- The team with the lowest strikeout rate of the last 20 years was the 2002 Anaheim Angels who won the World Series that year. Since the mound was lowered, the 1980 Texas Rangers have the distinction of having the lowest strikeout rate for a season. The lowest strikeout rate for any team since 1901 was the 1901 Boston Americans with a strike out rate just over 5%
Cheers.
More one-hitters
R.A. Dickey just pitched his second consecutive one-hitter, which is just the latest feat in what is rapidly becoming the season of the pitchers. So far this season 9 guys other than Dickey have pitched a one-hittter (E.Santana, Cain, Verlander, Moore, Hammel, CJ Wilson, Vogelsong, Duffy, Felix Hernandez), which brings the total to 11 one-hitters so far this season as of June 19. This season has also featured 3 no-hitters (Jered Weaver, Johan Santana, and Millwood et al.) and 2 perfect games (Phil Humber and Matt Cain). So far, in total, there has been 16 games with one or fewer hits. (Let me remind you that it’s June 19). To compare, here is a list of complete seasons that had fewer than 16 low hit games: 2008 (15), 2007 (12), 2005 (10), 2004 (11), 2003 (10), 2002 (15), 2000 (11), 1999 (11), 1998 (10), 1996 (10), 1995 (13). That’s 11 out of the last 17 seasons. To re-iterate, in 7 of the last 11 completed seasons, their were fewer one hit or fewer games pitched during the entire season than have already been pitched so far in 2012.
Here is an updated graph for June 19th to include the latest one-hitters by Hammel and Dickey.
I’ve also created a graph looking at all games with fewer than 4 hits since 1918 (Note that this graph is on a log-scale).
Cheers.










