Olympics Boxplot

[7/23/2012 Addition: I’ve updated these plots using ggplot2 to look nicer.  They can be found here.]

Recently, I saw this pretty cool chart at the Washington Post (I originally saw the chart at this wonderful blog here) about the ages of olympians from the past three olympics.  I commented to myself that I thought it would be more interesting with boxplots of the data, rather than simple ranges, and I also wondered what it would look like if we used data from all of the past olympics.

So, I wrote some R code and began scraping sports-reference.com/olympics to get a data set with all of the olympic athletes from all of the games.  This took me quite some time (and work kept getting in the way), but I eventually got it right and collected the data.

Here are some of the resulting graphs:

Below is a graph of side-by-size boxplots of age for each sport by gender with blue for male, pink for female, and green for mixed competition.  And no the 11 year old female swimmer is not a typo like I originally thought.

The previous graph was kind of messy, so I’ve sorted this one by median age.  Not surprisingly female gymnastics and rhythmic gymnastics have the lowest median ages of competitors while equestrianism has the highest median age of competitor at over 35 years of age.

The previous two graphs were only for the years of 2000-2008, so I re-did the previous graph using data from all of the olympics.  Since the obvious question arising from this graph is is “What is roque?”, I have saved you the trouble of googling it by providing a wikipedia link for roque.

This graph is boxplots of age by year with the color representing the host continent.

[7/15/2012 Correction: The original post had the 1956 box colored blue for Europe.  However, commenter Mules points out that 1956 should actually be yellow for Australia.  They are correct and the correction has been made.  However, as I point out in response, I’m not totally wrong: The equestrian events had to be held in Stockholm, Sweden due to quarantine restrictions.]

[7/23/2012 Correction: The graph below had some mistakes in it, including an olympian who was over 90.  This was pointed out by Kate, and has been corrected.]

And finally, we have overall age by gender.



Posted on July 9, 2012, in Math Pictures, Olympics, R, Sports. Bookmark the permalink. 27 Comments.

  1. Very cool. Thanks for sharing. There are some old equestrians (septuagenarians)!

  2. isomorphismes

    So this is very nit picky but I like my boxplots to have closed small circles pch=19 with an alpha col=rgb(.1,.1,.1,.5) so I can see the when the outliers are thick. Could you post your code to github so I can be a nitpicking “perfectionist” with the charts? 😉

  3. Neil Martinsen-Burrell

    I agree that it would be very fun to see the code that made these. A nice educational resource and a jumping off point for other ways to slice the same data. The CRAPL (http://matt.might.net/articles/crapl/) is an entertaining way to put this sort of work out in public.

    • I think I finally got github working. Here (should) be my Olmypics code. Feedback (politely) is welcome.

      I really like the idea of CRAPL. And this quote from here sums up my code pretty well:

      I kept telling myself that I’d clean it all up and release it some day.

      I have to be honest with myself: this clean-up is never going to happen.

  4. Just a small point but on the “age by year with the color representing the host continent” graph 1956 should be yellow

  5. I’d like to see a graph of sport v age v sex v result (say for the top 5-8 in each event, but if it’s only the medalists [top 3], that’s ok). That would reveal what age range is needed to be competitive in which sports, and exclude the (many) ceremonial entries from countries that don’t really have a serious program in that particular sport (e.g. think Jamaica in the bobsled).

    This would help illuminate which events truly require youth over experience/training (and perhaps point out which events don’t require much athleticism).

    I’d also like to see a graph adding in muscle v fat measures (e.g. BMI). I think there might be some surprises there (for US folk, anyway ;).


    • I’ll post the boxplot with medal winners, as that’s easy to do.

      As for the BMI suggestion, I’d like to see that plot, too. But I don’t have that data. If you’d like to send me that data, I’d be happy to make that graph.

      • Thanks – should be interesting.

        I’ve never run across a reasonable body fat percentage data set for any sport, let alone all olympians. I just tried once again to track one down, and didn’t find anything other than the usual numbers for a few, selected athletes. My initial interest in that stat came from a report that the US did this during a combined training camp in Colo Springs many years ago (and there were some surprises), but I was never able to find that data set, nor even confirm the story.

        Upon review, BMI would be a meaningless measure for this purpose (many olympic athletes rate as obese on that scale). Accurate body fat measurements are complex to get (e.g. an MRI), and I’m sure many countries are very protective of their athletes, which makes me fairly doubtful that anyone has this data in one place. Perhaps each country has this data though — however it would be measured by different methodologies, complicating the comparison.

        If I find such a data set though, I’ll certainly let you know.


  6. Here’s the boxplot for only medal winners from 2000-2008: http://bit.ly/O1mOtt

    • Very nice. I’m not sure of the sort order – avg of men’s and women’s ages?

      Who are the 40+ swimmers? Seems odd.

      How are the teams sports represented (again possible masking effects – men’s soccer/football seems a very narrow range to me)?

      Women’s cycling age is a surprise to me.

      • Men’s soccer (or “football” as the rest of the world refers to it “incorrectly”.
        According to Wikipedia:

        Since 1992 male competitors must be under 23 years old, with three over-23 players allowed per squad. The new format allows teams from around the world to compete equally, and African countries have taken particular advantage of this, with Nigeria and Cameroon winning in 1996 and 2000 respectively.

        Dara Torres was 41 in 2008 and she won three silver medals:

        The graphs are sorted by median.


  7. Note: as a little tweak, you can use FUN=paste0 instead of FUN=paste, sep=””.

  8. I had read before that the oldest Olympian ever was a Swedish shooter named Oscar Swahn and that he was in his 70s. But in the 4th graph it looks like there is someone who is 92 years old from 1932! Is that right? Am I missing something?

    ps I have really been enjoying these.

  9. That doesn’t look right. Must be a mistake. I’ll take a look at it tomorrow.

  10. HI, Great plots. Would it be possible to get a copy of the data (or your scraping scripts). I would love to use this data set for my class.

  1. Pingback: Olympics Boxplot | Probabilidades, Estadística y... | Scoop.it

  2. Pingback: Gender Wars • see things differently

  3. Pingback: Olympics Box Plots: Part 2 / ggplot2 Shoutout « Stats in the Wild

  4. Pingback: Outer Product of Character Vectors in R « Stats in the Wild

  5. Pingback: Example: Boxplots of Olympic Athletes’ Age Distributions | Math 1272/5196 – Statistics (Fall 2013)

  6. Pingback: World Cup ages | StatsbyLopez

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: