JSM Data Art show history
In the summer of 2016, JSM was held in Chicago. I live in Chicago had the idea to try to have a data art show somewhere in Chicago to coincide with JSM. So I tweeted out the idea asking if anyone knew a venue that would be appropriate for hosting this. Well, through the power of twitter, the good people at the ASA suggested I have the data art show at JSM. So we sent out a call for art and ended up with a nice little data art show featuring Alisa Singer, Craig Miller, Elizabeth Pirraglia, Gregory J. Matthews, Marcus Volz , and Jillian Pelto.
In 2017, we did the show again, this time in Baltimore featuring work by Lucy D’Agostino McGowan and Maëlle Salmon, Gregory J. Matthews, and Elizabeth Pirraglia. It was a little bit smaller in terms of participation, but I blame myself mostly. I had a baby at the end of 2016 so I spent a lot less time publicizing the 2017 show, and we had far fewer applicants.
So for the 2018 show in Vancouver, I want to get the word out that there will be another show and to encourage all of you to apply. (Yes you!) If you want to apply, you have until May 15 to submit your work for consideration to email@example.com. Full details of how and where to submit your work can be found here. And if you don’t want to apply yourself, please send this to someone who you think might be interested in submitting work.
Also, a few more favors to ask of you
- I will be unable to attend JSM this year for the first time in TEN years because I am having baby number 2 in July. So I’m looking for someone who will be attending the event who can act as a sort of coordinator for the event. This is minimal effort and basically requires you to check that it gets set-up. I’d also like you to take some pictures of the event and send them to me.
- Would anyone be willing to set my work up at JSM if I ship in to the convention center? And then ship it back to me? I will, of course, cover all the costs of shipping.
- If anyone reading this knows someone in Vancouver who is connected to the local art world there, I would appreciate them forwarding this to them.
Maybe I should move to netlify too?
To facilitate an easier sharing of code and figures, I’ve started a RMarkdown blog, which you will find at http://statsbylopez.netlify.com/. All new blog posts will be shared at this new site.
I’m going to keep the WordPress site active for the time being, so past articles aren’t going anywhere. In the meantime, thanks for four years of reading and fun! Hopefully the next site will be a success.
Round of 64
I think generally the committee did a pretty good job this year, at least in terms of first round games. The only lower seeded teams that I have favored in the first round are Florida St. over Missouri and Butler over Arkansas (though I have Houston as only a tiny favorite over San Diego State). As far as most likely possible upsets? Here are the double digit seeds I think are most likely to win in round 1 (in order of likelihood) :
(11) Loyola-Chicago over (6) Miami (That’s not what my model says, but I’m contractually obligated to say this)
(11) San Diego State over (6) Houston
(10) Texas over (7) Nevada
(10) Providence over (7) Texas A&M
(12) Davidson over (5) Kentucky
(12) New Mexico St over (5) Clemson
(11) St. Bonaventure over (6) Florida
(12) Murray State over (5) West Virginia
(12) South Dakota State over (5) Ohio State
Then if you want to get crazy and go for some big time first round upsets I would pick these (in order of likelihood):
(14) Montana over (3) Michigan
(13) Marshall over (4) Wichita State
(13) Charleston over (4) Auburn
(14) Wright State over (3) Tennessee
(14) S.F. Austin over (3) Texas Tech
(15) Georgia State over (2) Cincinnati
Round of 32
Nothing really interesting here. I have all the 1-4 seeds favored to make this round with the exception of Wichita State, which I have as an underdog to West Virginia.
Looking to pick an upset? Most likely 5 seed or higher to make the Sweet Sixteen:
(5) West Virginia
(8) Seton Hall
(11) San Diego State
(7) Texas A&M
Want to get real crazy with it?
(12) New Mexico State
(11) St. Bonaventure
(12) Murray State
Here is where things start to get a bit interesting. I have Villanova, Purdue, Kansas, Duke/Michigan St*, North Carolina, Cincinnati, Virginia, and Gonzaga.
I think the two most potentially interesting games in this round are Gonzaga vs Xavier and Duke vs Michigan St. I think Xavier is way overrated and Gonzaga is underrated so I think it will be interesting to see if Xavier lives up to its one seed here. The other game, Duke vs Michigan St, I think would be a good Final Four matchup. I’m taking whoever wins this game to go all the way to the finals. I just have no idea who is going to win this game, so I’m not picking the winner of that game, but I am advancing them in the bracket as a /. It’s my blog and I can do what I want.
Looking to pick a double digit seed to the Elite 8? How about these teams:
(11) San Diego State
(12) New Mexico State
(11) St. Bonaventure
(15) Georgia State
Alright. I’ve got Virginia over Cincinnati. Villanova over Purdue. Duke/Michigan State over Kansas. And North Carolina over Gonzaga. That’s 3 ACC teams. Ugh. And Duke. The most Ugh.
I think Butler and Texas A&M as Final Four teams are interesting picks as well as Seton Hall and Miami.
I’m taking Virginia over North Carolina and Duke/Michigan State over Villanova.
I’m taking Virginia over Duke.
Want some cray picks to win the championship?
(5) West Virginia
(5) Ohio State
(9) Florida State
(9) Seton Hall
If you need me Friday morning, I’ll be crying in a corner next to the remains of my bracket.
Oh. And for god’s sake NCAA, pay the players!
Of the four number 1 seeds, Virginia, Villanova, Kansas, and Xavier, Xavier is far and away the weakest number 1 seed in this tournament (I have them ranked 15th overall).
Estimated chances of making the Sweet Sixteen
- Villanova(1) – 81.14%
- Virginia(1) – 77.89%
- Purdue(2) – 76.24%
- Kansas(1) – 74.40%
- Duke(2) – 74.03%
- Michigan St(3) – 72.59%
- North Carolina(2) – 72.02%
- Cincinnati(2) – 66.60%
- Tennessee(3) – 59.16%
- Auburn(4) – 59.08%
- Gonzaga(4) – 58.10%
- Texas Tech(3) – 56.45%
- Xavier(1) – 56.06%
- Michigan(3) – 53.65%
- West Virginia(5) – 52.09%
- Arizona(4) – 49.81%
- Wichita St(4) – 43.58%
- Ohio St(5) – 39.19%
- Florida(6) – 38.18%
- Clemson(5) – 3.158%
- Kentucky(5) – 30.56%
- Miami FL(6) – 29.12%
- Florida St(9) – 29.12%
- Houston(6) – 28.67%
- Texas A&M(7) – 22.14%
- Oklahoma(10) – 19.01%
- TCU(6) – 17.83%
- Texas(10) – 16.31%
- Nevada(7) – 15.39%
- Butler(10) – 15.32%
- Missouri(8) – 14.65%
- Seton Hall(8) – 14.26%
- San Diego St(11) – 13.49%
- Creighton(8) – 12.19%
- Virginia Tech (8)- 12.05%
- Davidson(12) – 11.82%
- NC State(9) – 10.83%
- Loyola Chicago(11) – 9.83%
- Kansas St(9) – 9.77%
- Arizona St/Syracuse(11) – 8.38%
- Arkansas(7) – 8.32%
- Buffalo(13) – 7.81%
- Alabama(9) – 6.78%
- Rhode Island(7) – 6.59%
- New Mexico St(12) – 6.42%
- Providence(10) – 5.44%
- St. Bonaventure/UCLA(11) – 4.30%
- Montana(14) – 4.19%
- Murray St(12) – 3.64%
- Charleston(13) – 2.92%
- Wright St(14) – 1.89%
- Georgia St(15) – 1.70%
- S Dakota St(12) – 1.55%
- Bucknell(14) – 1.20%
- Greensboro(13) – 1.16%
- SF Austin(14) – 1.07%
>0% and <1%: Marshall(13), Penn(16), Lipscomb(15), Iona(15), Texas Southern(16), UMBC(16), CS Fullerton(15), Radford(16), LIU Brooklyn(16), NC Central(16).
Estimated chances of making the Final Four
- Virginia – 41.49%
- VIllanova – 32.07%
- Purdue – 30.79%
- North Carolina – 30.33%
- Michigan St – 29.11%
- Duke – 27.19%
- Cincinnati – 22.78%
- Kansas – 21.91%
- Gonzaga – 21.27%
- West Virginia – 11.88%
- Xavier – 11.67%
- Tennessee – 10.91%
- Michigan – 10.79%
- Auburn – 10.38%
- Ohio St – 10.24%
- Texas Tech – 9.08%
- Arizona – 8.51%
- Wichita St – 7.31%
- Florida St – 4.71%
- Texas A&M – 4.68%
- Florida – 3.94%
- Houston 3.54%
- Miami FL – 3.43%
- Kentucky 3.42%
- Clemson – 2.99%
- TCU – 2.70%
- Oklahoma – 2.66%
- Creighton – 2.37%
- Texas – 2.27%
- Butler – 2.19%
- Nevada – 1.89%
- Kansas St – 1.64%
- Missouri – 1.51%
- Virginia Tech – 1.24%
>0% and <1%: Seton Hall, Arkansas, NC State, Arizona St, San Diego St, Loyola Chicago, Davidson, Alabama, Providence, Rhode Island, Buffalo, New Mexico St, Murray St, St Bonaventure, Montana, Georgia St, Charleston, Bucknell, Wright St, South Dakota St, Greensboro, SF Austin
Estimated chances of winning NCAA tournament
- Virginia – 13.85%
- Villanova – 12.50%
- Purdue – 11.94%
- Michigan St – 10.23%
- Duke – 9.03%
- North Carolina – 7.15%
- Kansas – 5.50%
- Cincinnati – 5.30%
- Gonzaga – 4.49%
- West Virginia – 3.33%
- Texas Tech – 1.84%
- Auburn – 1.84%
- Xavier – 5.13%
- Wichita St – 1.56%
- Tennessee – 1.38%
- Ohio St – 1.34%
- Michigan – 1.25%
- Arizona – 1.06%
- Florida St – 0.63%
- Florida – 0.60%
- Texas A&M – 0.47%
- TCU – 0.34%
- Oklahoma – 0.28%
- Houston – 0.26%
- Clemson – 0.25%
- Kentucky – 0.24%
- Creighton – 0.24%
- Butler – 0.24%
- Miami FL – 0.23%
- Missouri – 0.14%
- Virginia Tech – 0.13%
- Texas – 0.13%
- Kansas St – 0.13%
- Arizona St – 0.08%
- Seton Hall – 0.07%
- San Diego St – 0.06%
- Nevada – 0.06%
- NC State – 0.03%
- Arkansas – 0.03%
- Providence – 0.02%
- Alabama – 0.02%
- Rhode Island – 0.01%
- Loyola Chicago – 0.01%
- Davidson – 0.01%
Highest Seed Remaining in Conference Tournament
1 seeds: Virginia, Xavier, Michigan St, Villanova
2 seeds: Duke, Kansas, Purdue, Tennessee
3 seeds: Cincinnati, Wichita St, Auburn, North Carolina
4 seeds: Michigan, Kentucky, Ohio St, Clemson
5 seeds: Nevada, Arkansas, Miami (FL), Virginia Tech
6 seeds: Texas A&M, Gonzaga, West Virginia, Houston
7 seeds: Texas Tech, Missouri, TCU, New Mexico St
8 seeds: Kansas St, Arizona, Creighton, Mississippi St
9 seeds: Florida, Nebraska, Butler, Florida St
10 seeds: Louisville, NC St, Baylor, MTSU
11 seeds: St. Bonaventure, Seton Hall, Oklahoma/Oklahoma St, Texas/Marquette
12 seeds: Rhode Island, SF Austin, Loyola-Chicago, Murray St
13 seeds: San Diego St, Montana, Buffalo, Charleston
14 seeds: Bucknell, Penn, Marshall, Wright St
15 seeds: CS Fullerton, Georgia St, Texas Southern, Iona,
16 seeds: NC Central, Lipscomb, UMBC/UNCG , Radford/LIU Brooklyn
First Four out: Alabama, USC, Providence, St. John’s
Next Four out: Notre Dame, ULL, Syracuse, LSU
Every couple of years in February I get around to writing about Super Bowl squares. It’s been a few years, so I decided to update the post. So here is the updated 2 dimensional histogram of how often certain numbers occur in Super Bowl squares. Nothing new here. You want to get some combination of 7-0 or 0-7 followed by 7-7, 7-4, and 4-7. 3-0, 4-0, and 0-0 are also good. Try not to get 2-2. (Though it does happen).
Next, rather than looking at 7-0 and 0-7 as different, I let those count as the same outcome giving the following 2 dimensional histogram. Basically the same amount of information — You want 7-0 and you don’t want 2-2.
Next, what I was wondering about was how this changed over time. Here is a plot of each end digit for all games played by season. The most notable part of this graph is that 0 dropped very rapidly from 1920-1960 stemming from far fewer games ending with one team getting shut out. You can also see some other smaller trends over this time period such as 1, 7, 4, and 8 increasing with 6 and 3 decreasing. But this plot is kind of a mess and there are way too many lines on it. Let’s use facet_wrap().
Ahh! Much easier to trends in numbers over time. Let’s go through these number by number. 0 dropped rapidly from 1920 – 1960 then increased slightly until about 1980 when it began another small decline. 1 increased quickly and has basically been flat since the 1970s. 2 has been flat forever. 3 has a small decline through 1940, but has been slowly increasing ever since. 4 looks like it peaked in 1950 and has been slowly dropping since then. 5 is basically 2 — flat. 6 follows roughly the same pattern as 3 — a small decrease until 1950 and then slowly increasing. 7 peaked in 1940 and has been slowly decreasing since then. 8 peaked in 1950 came back down and has been flat since 1970. Finally, 9 has been basically flat.
Lastly, another way to look at this is with a heat map over time. The plot below shows the relative frequency of last digits over time with dark red indicating large numbers and dark blue indicating low numbers.
All the code for generating these plots can be found here.
I was looking at some recent Super Bowl lines — this year’s line is 5 and the last seven previously were 3, 4.5, 1, 2, 4.5, 2.5, 3, 5 — and thought to myself that they seemed very small. I remember Super Bowl’s from when I was in middle/junior/high school being much larger. The largest of these was when the 49ers played the Chargers — led by legendary quarterback Stan Humphries — in Super Bowl XXIX and the 49ers were favored by 18.5 (which they covered!) So I found historical Super Bowl lines at www.oddsshark.com and checked to see is the lines have in fact been smaller than when I was a youth. The plot below shows historical lines with the color indicating whether the underdog or favorite won (or if there was a push). It’s very clear in the 1990s that the lines were actually much larger than the recent Super Bowls. Anyone got any ideas why this might be? Is it just totally random or is there something different about the NFL that make the lines smaller?
Also, I’ve marked the Patriots Super Bowls with the black dots. What you’ll notice is that in the Brady era,in the first 6 Patriot Super Bowls, the underdog covered in all of them. The first time the favorite won in a Brady Patriots Super Bowl was last year when they beat the Falcons.
As an added bonus, I also looked at Super Bowl totals with results. The general trend of the total was increasing through the 1970s through the beginning of 1990s when it leveled off in the high 40s. It sort of looks like it the totals may be trending up, but it’s hard to say if it’s actually a trend. Though it is interesting to not that the highest Super Bowl total ever was last year — 57.5 — and the second highest was 8 years ago in Super Bowl XLIV — 57. (Note: This year’s total is 48.)
Obviously, I have to end this post with my pick.
Based on absolutely no analysis, I’m taking the Eagles +5 and under 48 with the final score Patriots 24-21.
I heard some ridiculous stat in passing that the Cleveland Browns have started some huge number of quarterback in the last 10 years. So what I wanted to do was check to see how often teams in the NFL are switching quarterbacks. I spent quite a bit of time trying to figure out how to scrape starting QB information off of pro football reference, until I remembered that some smart people had written an R package called nflscrapr (there is also nhlscrapr is hockey is what you are interested in). Thanks to them, I was able to put together what I wanted in like 20 minutes. (Well, almost what I wanted. I ended up using the quarterback who had the most pass attempts in a given game rather than starting QB, but a vast majority of the time these are the same.)
So here is what I made:
Each row is a team with J colors, where J is the number of quarterbacks that have started for each team from 2009-2017. The rows are sorted by the team with the largest maximum number of starts by a quarterback (Chargers and Phillip Rivers) down to…..the Browns. The J colors per team are chosen using rainbow(J) so that the quarterback with the most starts for a team is reddish and the quarterbacks with the least starts are bluish. Click on the image and zoom in and see if you can name all the quarterbacks by their initials. As an added bonus, try to find the thing that looks like a mistake, but isn’t actually a mistake.
The code I used to make this is here.
Live Blogging Tango’s Live blog of my Live Blog Live Blogging His Live Blog of the WAR Podcast on Stolen Signs.
use a highly ad hoc methods to reach their replacement level
That’s a FEATURE not a bug!
… still reading
What? How is an ad hoc method better than a definition of a replacement player and then letting the data estimate quantities for us?
and a replacement pool of players and from that pool we are estimating a replacement player for each position
Still a problem.
At least we are trying. We define what a replacement level player is and then try to estimate it based on data. That is how statistics works.
You can criticize is all you want, but what’s a better way to do it? I haven’t seen one.
The way I’m doing it is better. It’s what BRef and Fangraphs have implemented.
There’s art to this. It’s not all science. Well, it’s Bayesian, which is artistic science. Or scientific art?
This is truly astonishing. Tom Tango is a statistical analyst and I believe he is arguing that his completely ad hoc definition of replacement players, which he himself refers to as a feature and not a bug, is better than our attempt at defining a replacement player and then estimating is from the data. That is truly astonishing.
It’s Bayesian in the sense that you seem to have set an ad hoc prior on what replacement level should be, but then you don’t use any data to update what the replacement level is. Tango’s definition of replacement player is Bayesian if you consider a model with no data updating the prior (i.e. the posterior is the same as the prior) to be a Bayesian model.
Does anyone know exactly where these numbers came from?
Yes, me! I have dozens of threads on the topic on my old blog. I don’t have a single best source, but there’s a few easy to find sources.
I’m pretty sure the Fangraphs Library points to one or two of these. And I’m almost positive that Dave at Fangraphs in introducing WAR to Fangraphs talked about it at length in his N-part series.
The internet is the wild west. That’s a feature. It’s not cited like an academic paper, and that’s a feature too.
Being difficult to find is a feature? I disagree.
We think this is important because if two players have the exact same stats and one is a short stop and the other is a first baseman, we believe that the shortstop is providing more offensive value to that team because they are filling a difficult fielding position.
And if the CF outhit the LF, he’d say the same thing. Which is the problem that Sean noted.
If CF are collectively out hitting left fielders then I would argue they are more valuable. But this doesn’t happen in practice. You can argue all you want about theoretically how valuable a position is, but a better way to do it is you let the data tell you want the answer is. For example, in 2017, using 1B as a baseline, the CF was shifted -0.02221 runs per plate appearance and the RF was shifted -0.00801 runs per plate appearance. This indicates that the CF will get more credit than the RF for the same offensive play, but only by 0.0142 per plate appearance. So while I think the argument that offensive players shouldn’t be adjusted for what position they play is a weak argument, even if I do accept it, your complaint is still meaningless because it’s not an issue in practice. The collective RF over the course of a season will outhit the collective CF over the course of a season in practice. It’s possible that someday the CF will outhit the RF, but over the course of TENS OF THOUSANDS of plate appearances this seems unlikely. And further, if that’s what the data is telling us, then we should listen!!!
Ultimately, I don’t think I am right or Tango is right (though Tango thinks openWAR is wrong in this aspect), I just think this particular issue is a matter of choice. We’ve made one choice and Tango has made another and I think they are both valid. I prefer the openWAR way of doing it, but I certainly understand different points of view on this particular choice of adjustments.
I’m just actually wondering if he has.
I did at the time. I think I even had a thread about it. It’s possible I didn’t understand all of it, but I’m pretty sure I understood just about all of it.
We aren’t doing this adjustment at the player level, we are doing this adjustment at the PLATE APPEARANCE level. This means that the average plate appearance involving a CF at the plate will be the same as the average plate appearance involving a RF at the plate. Once all of the plate appearances are aggregated, there is no implication that the average CF has to equal the average LF. It COULD happen, but it would be due to chance.
At the level for the league, and over a period of years, it’ll work out to the same thing just about. Certainly won’t be chance. Or, I have no idea what I’m talking about.
Who cares if it won’t be interpreted properly? It’s the right thing to do statistically!
That in a nutshell is the difference between what Forman and Appelman do, and what everyone else does.
Again, truly astonishing. It sounds to me an awful lot like Tango is saying that what Forman and Appelman do is some data manipulation, and that what everyone else is doing is valid statistics. I don’t really think he means it that way, but it sounds like that. I think (and I’m sure he will tell me if I am wrong) what Tango means is that Forman and Appelman are producing advanced statistics for a general audience and everyone else is making things too complicated. I disagree with this entirely. I don’t think we, as statisticians should dumb it down for a general audience. Instead we should get better at communicating what we are doing. Again, the solution isn’t to dumb it down, it’s to explain out work better.
would casually dismiss a discussion of uncertainty. Maybe I’m misinterpreting this though.
Eh… it’s more that I didn’t think anything would come of it. Compared to the rest of the podcast, where someone driving by could get something out of it, the talk of uncertainty was just theoretical really. We need to see it in action. In other words, it was too early.
That said, I have talked about it for PECOTA, for those who remember when Colin joined BPro, and the “percentile” presentation. I had a few good threads on it, probably around 2009 or 2010.
It’s literally happening as we speak at baseball prospectus.
Here’s one thread on OpenWAR, with Ben Baumer stopping by:
This was an interesting thread, but I’m most interesting in the first comment, which you can read here:
This is so lazy. And as for doubting “it will get any traction”, I’ll just leave this here: Baumer, Brudnicki, McMurray win 2016 SABR Analytics Conference Research Awards #sickburn
You’ll see all the comments in there center around the positional issue.
And more OpenWAR discussion here:
There are some valid (and other not so valid) points that are brought up in this discussion. I’ll highlight one of the not so valid points here:
Tango says: “There is no question that fWAR and rWAR have far more “open source” ideals behind them than openWAR. What? openWAR is fully reproducible with using only publicly available data and the ENTIRE source code is publicly available. This is not true of ANY of the other major implementations of WAR. Argue all you want that openWAR sucks. I’m fine with that (I disagree, but I’m fine with it), but to say that “fWAR and rWAR have far more ‘open source’ ideals behind them than openWAR” seems to me to be a complete misunderstanding of what open source means and what the ideals of open source are.