Olympics Box Plots: Part 2 / ggplot2 Shoutout
So, last Monday I posted some Olympics boxplots, and then I left for the week to go golfing with my father (We play 7 rounds in 4 days, you take your best score on each hole, add them up and declare a winner. I won 66-68 this year. Pops now leads the all-time series 3-2). When I came back, the blog has thousands of hits, which I assumed initially to be a mistake. Turns out, however, those boxplots ended up on the front page of FlowingData, which has way more readers than I do.
While I’m excited to be mentioned on FlowingData, I don’t really think the basic R graphics boxplots really live up to the standard set over there (Nathan Yau described my plots as as “barebones”, and he’s right). I mean this post of “Every Idea in History” is just straight up impressive. So since I’ve been trying to learn a new graphing package in R, I decided to update the Olympics plots using ggplot2 to try and look a bit more professional.
In my very limited usage of ggplot2, I have found that it is a little bit harder to use than base graphing and plotting in R. However, I suspect this is only due to the fact that I am used to one way of plotting in R and, I suspect, as I use ggplot2 more often I won’t believe that I ever used the old way of plotting (its like when I switch from Windows to Unix: A little bit harder to learn, but much better). While it’s taking me a bit of time to learn, I also have to say that, once you figure out the code, its much easier to get exactly what you want. For instance, when I originally did my Olympics plots I wanted to order the sports by median age and then split each sport by gender. I had quite a difficult time doing this in base graphing in R (and just ended up not doing it all together), but ggplot handled this very easily (the code is at the end).
Below are side-by-side box plots of the ages of olympics athletes sorted by median age of the competitor in each sport first for the years 2000-2008 followed by all years. Within each sport the gender of the competitors is separated out into the appropriate number of box plots, so now the gender distribution within each sport can be easily compared to one another.
Finally, two things I should note that I didn’t mention in the first post:
- I’ve added a small bit of noise to each of the ages so that the outliers can be seen more clearly.
- This data that I am using is not the complete set of Olympians all time, though it is the vast, vast, majority of them. When I was scraping, some athletes’ names contained non-standard characters (e.g. é or ü), and these had to be converted to the English alphabet equivalent (e.g. e or u). While I manually corrected many of these, I do not believe that I corrected all of them. So, there are probably a few Olympians missing from my data set, though I believe it is a very, very small number relative to the total number of athletes.
p+ scale_fill_manual(name = “”, values = c(“green”,”pink”, “blue”), labels = c(“B” = “Both”, “F” = “Female”,”M”=”Male”))+ xlab(“Sport”)+opts(axis.text.x=theme_text(angle=-90))+ opts(title=”Age Distribution of Olympic Athletes by Age and Gender: 2000-2008″)