Category Archives: Math Pictures

Olympics Box Plots: Part 2 / ggplot2 Shoutout

So, last Monday I posted some Olympics boxplots, and then I left for the week to go golfing with my father (We play 7 rounds in 4 days, you take your best score on each hole, add them up and declare a winner.  I won 66-68 this year. Pops now leads the all-time series 3-2).  When I came back, the blog has thousands of hits, which I assumed initially to be a mistake.  Turns out, however, those boxplots ended up on the front page of FlowingData, which has way more readers than I do.

While I’m excited to be mentioned on FlowingData, I don’t really think the basic R graphics boxplots really live up to the standard set over there (Nathan Yau described my plots as as “barebones”, and he’s right).  I mean this post of “Every Idea in History” is just straight up impressive.  So since I’ve been trying to learn a new graphing package in R, I decided to update the Olympics plots using ggplot2 to try and look a bit more professional.

In my very limited usage of ggplot2, I have found that it is a little bit harder to use than base graphing and plotting in R.  However, I suspect this is only due to the fact that I am used to one way of plotting in R and, I suspect, as I use ggplot2 more often I won’t believe that I ever used the old way of plotting (its like when I switch from Windows to Unix: A little bit harder to learn, but much better).  While it’s taking me a bit of time to learn, I also have to say that, once you figure out the code, its much easier to get exactly what you want.  For instance, when I originally did my Olympics plots I wanted to order the sports by median age and then split each sport by gender.  I had quite a difficult time doing this in base graphing in R (and just ended up not doing it all together), but ggplot handled this very easily (the code is at the end).

Below are side-by-side box plots of the ages of olympics athletes sorted by median age of the competitor in each sport first for the years 2000-2008 followed by all years.  Within each sport the gender of the competitors is separated out into the appropriate number of box plots, so now the gender distribution within each sport can be easily compared to one another.

Finally, two things I should note that I didn’t mention in the first post:

  • I’ve added a small bit of noise to each of the ages so that the outliers can be seen more clearly.
  • This data that I am using is not the complete set of Olympians all time, though it is the vast, vast, majority of them.  When I was scraping, some athletes’ names contained non-standard characters (e.g. é or ü), and these had to be converted to the English alphabet equivalent (e.g. e or u).  While I manually corrected many of these, I do not believe that I corrected all of them.  So, there are probably a few Olympians missing from my data set, though I believe it is a very, very small number relative to the total number of athletes.


p<-qplot(reorder(factor(Sport),Age.median), Age,fill=factor(Sex),data=dat.summer,geom=”boxplot”)

p+ scale_fill_manual(name = “”, values = c(“green”,”pink”, “blue”),  labels = c(“B” = “Both”, “F” = “Female”,”M”=”Male”))+ xlab(“Sport”)+opts(axis.text.x=theme_text(angle=-90))+ opts(title=”Age Distribution of Olympic Athletes by Age and Gender: 2000-2008″)

The next start after a No-Hitter


Olympics Boxplot

[7/23/2012 Addition: I’ve updated these plots using ggplot2 to look nicer.  They can be found here.]

Recently, I saw this pretty cool chart at the Washington Post (I originally saw the chart at this wonderful blog here) about the ages of olympians from the past three olympics.  I commented to myself that I thought it would be more interesting with boxplots of the data, rather than simple ranges, and I also wondered what it would look like if we used data from all of the past olympics.

So, I wrote some R code and began scraping to get a data set with all of the olympic athletes from all of the games.  This took me quite some time (and work kept getting in the way), but I eventually got it right and collected the data.

Here are some of the resulting graphs:

Below is a graph of side-by-size boxplots of age for each sport by gender with blue for male, pink for female, and green for mixed competition.  And no the 11 year old female swimmer is not a typo like I originally thought.

The previous graph was kind of messy, so I’ve sorted this one by median age.  Not surprisingly female gymnastics and rhythmic gymnastics have the lowest median ages of competitors while equestrianism has the highest median age of competitor at over 35 years of age.

The previous two graphs were only for the years of 2000-2008, so I re-did the previous graph using data from all of the olympics.  Since the obvious question arising from this graph is is “What is roque?”, I have saved you the trouble of googling it by providing a wikipedia link for roque.

This graph is boxplots of age by year with the color representing the host continent.

[7/15/2012 Correction: The original post had the 1956 box colored blue for Europe.  However, commenter Mules points out that 1956 should actually be yellow for Australia.  They are correct and the correction has been made.  However, as I point out in response, I’m not totally wrong: The equestrian events had to be held in Stockholm, Sweden due to quarantine restrictions.]

[7/23/2012 Correction: The graph below had some mistakes in it, including an olympian who was over 90.  This was pointed out by Kate, and has been corrected.]

And finally, we have overall age by gender.


Coke, Pop, or Soda

I found this nice picture below over at Edwin Chen’s blog post about Soda vs Pop with Twitter.  Enjoy!



So I had lunch today with two stats post-docs, and the R package ggplot2 came up.  So I started dabbling and here is my first graph.  It’s the same as my one-hitter graph, but instead it was made with ggplot2.  I think it looks a bit more professional.


Via Slate: A Map That Should Panic the Obama Campaign

Looks like we should all move to North Dakota: A Map That Should Panic the Obama Campaign.


Obama’s 262 Drone Strikes in Pakistan

Obama’s 262 Drone Strikes in Pakistan


Which Nations Consume the Most Water?

Which Nations Consume the Most Water? via


U.S. Breweries: A Map

Map of 1000 US Breweries via


Tornado Tracks

One year ago today, a tornado hit Springfield, MA and, luckily, no one in Springfield was killed.  It missed my house by about a quarter mile.  Some of my friends weren’t so fortunate and their houses were damaged to varying degrees.  My sister was leaving work right before it hit, and she had to run to her car to avoid being out in the open as it passed.  The tornado smashed most of the windows in her parked car while she was in it.  Pretty traumatic stuff.

In honor of the one year anniversary of the Springfield tornado, I present a fantastic visualization of tornados: Tornado Tracks: 56 years of tornado tracks, by F-scale.