Author Archives: statsinthewild
Every couple of years in February I get around to writing about Super Bowl squares. It’s been a few years, so I decided to update the post. So here is the updated 2 dimensional histogram of how often certain numbers occur in Super Bowl squares. Nothing new here. You want to get some combination of 7-0 or 0-7 followed by 7-7, 7-4, and 4-7. 3-0, 4-0, and 0-0 are also good. Try not to get 2-2. (Though it does happen).
Next, rather than looking at 7-0 and 0-7 as different, I let those count as the same outcome giving the following 2 dimensional histogram. Basically the same amount of information — You want 7-0 and you don’t want 2-2.
Next, what I was wondering about was how this changed over time. Here is a plot of each end digit for all games played by season. The most notable part of this graph is that 0 dropped very rapidly from 1920-1960 stemming from far fewer games ending with one team getting shut out. You can also see some other smaller trends over this time period such as 1, 7, 4, and 8 increasing with 6 and 3 decreasing. But this plot is kind of a mess and there are way too many lines on it. Let’s use facet_wrap().
Ahh! Much easier to trends in numbers over time. Let’s go through these number by number. 0 dropped rapidly from 1920 – 1960 then increased slightly until about 1980 when it began another small decline. 1 increased quickly and has basically been flat since the 1970s. 2 has been flat forever. 3 has a small decline through 1940, but has been slowly increasing ever since. 4 looks like it peaked in 1950 and has been slowly dropping since then. 5 is basically 2 — flat. 6 follows roughly the same pattern as 3 — a small decrease until 1950 and then slowly increasing. 7 peaked in 1940 and has been slowly decreasing since then. 8 peaked in 1950 came back down and has been flat since 1970. Finally, 9 has been basically flat.
Lastly, another way to look at this is with a heat map over time. The plot below shows the relative frequency of last digits over time with dark red indicating large numbers and dark blue indicating low numbers.
All the code for generating these plots can be found here.
I was looking at some recent Super Bowl lines — this year’s line is 5 and the last seven previously were 3, 4.5, 1, 2, 4.5, 2.5, 3, 5 — and thought to myself that they seemed very small. I remember Super Bowl’s from when I was in middle/junior/high school being much larger. The largest of these was when the 49ers played the Chargers — led by legendary quarterback Stan Humphries — in Super Bowl XXIX and the 49ers were favored by 18.5 (which they covered!) So I found historical Super Bowl lines at www.oddsshark.com and checked to see is the lines have in fact been smaller than when I was a youth. The plot below shows historical lines with the color indicating whether the underdog or favorite won (or if there was a push). It’s very clear in the 1990s that the lines were actually much larger than the recent Super Bowls. Anyone got any ideas why this might be? Is it just totally random or is there something different about the NFL that make the lines smaller?
Also, I’ve marked the Patriots Super Bowls with the black dots. What you’ll notice is that in the Brady era,in the first 6 Patriot Super Bowls, the underdog covered in all of them. The first time the favorite won in a Brady Patriots Super Bowl was last year when they beat the Falcons.
As an added bonus, I also looked at Super Bowl totals with results. The general trend of the total was increasing through the 1970s through the beginning of 1990s when it leveled off in the high 40s. It sort of looks like it the totals may be trending up, but it’s hard to say if it’s actually a trend. Though it is interesting to not that the highest Super Bowl total ever was last year — 57.5 — and the second highest was 8 years ago in Super Bowl XLIV — 57. (Note: This year’s total is 48.)
Obviously, I have to end this post with my pick.
Based on absolutely no analysis, I’m taking the Eagles +5 and under 48 with the final score Patriots 24-21.
I heard some ridiculous stat in passing that the Cleveland Browns have started some huge number of quarterback in the last 10 years. So what I wanted to do was check to see how often teams in the NFL are switching quarterbacks. I spent quite a bit of time trying to figure out how to scrape starting QB information off of pro football reference, until I remembered that some smart people had written an R package called nflscrapr (there is also nhlscrapr is hockey is what you are interested in). Thanks to them, I was able to put together what I wanted in like 20 minutes. (Well, almost what I wanted. I ended up using the quarterback who had the most pass attempts in a given game rather than starting QB, but a vast majority of the time these are the same.)
So here is what I made:
Each row is a team with J colors, where J is the number of quarterbacks that have started for each team from 2009-2017. The rows are sorted by the team with the largest maximum number of starts by a quarterback (Chargers and Phillip Rivers) down to…..the Browns. The J colors per team are chosen using rainbow(J) so that the quarterback with the most starts for a team is reddish and the quarterbacks with the least starts are bluish. Click on the image and zoom in and see if you can name all the quarterbacks by their initials. As an added bonus, try to find the thing that looks like a mistake, but isn’t actually a mistake.
The code I used to make this is here.
Live Blogging Tango’s Live blog of my Live Blog Live Blogging His Live Blog of the WAR Podcast on Stolen Signs.
use a highly ad hoc methods to reach their replacement level
That’s a FEATURE not a bug!
… still reading
What? How is an ad hoc method better than a definition of a replacement player and then letting the data estimate quantities for us?
and a replacement pool of players and from that pool we are estimating a replacement player for each position
Still a problem.
At least we are trying. We define what a replacement level player is and then try to estimate it based on data. That is how statistics works.
You can criticize is all you want, but what’s a better way to do it? I haven’t seen one.
The way I’m doing it is better. It’s what BRef and Fangraphs have implemented.
There’s art to this. It’s not all science. Well, it’s Bayesian, which is artistic science. Or scientific art?
This is truly astonishing. Tom Tango is a statistical analyst and I believe he is arguing that his completely ad hoc definition of replacement players, which he himself refers to as a feature and not a bug, is better than our attempt at defining a replacement player and then estimating is from the data. That is truly astonishing.
It’s Bayesian in the sense that you seem to have set an ad hoc prior on what replacement level should be, but then you don’t use any data to update what the replacement level is. Tango’s definition of replacement player is Bayesian if you consider a model with no data updating the prior (i.e. the posterior is the same as the prior) to be a Bayesian model.
Does anyone know exactly where these numbers came from?
Yes, me! I have dozens of threads on the topic on my old blog. I don’t have a single best source, but there’s a few easy to find sources.
I’m pretty sure the Fangraphs Library points to one or two of these. And I’m almost positive that Dave at Fangraphs in introducing WAR to Fangraphs talked about it at length in his N-part series.
The internet is the wild west. That’s a feature. It’s not cited like an academic paper, and that’s a feature too.
Being difficult to find is a feature? I disagree.
We think this is important because if two players have the exact same stats and one is a short stop and the other is a first baseman, we believe that the shortstop is providing more offensive value to that team because they are filling a difficult fielding position.
And if the CF outhit the LF, he’d say the same thing. Which is the problem that Sean noted.
If CF are collectively out hitting left fielders then I would argue they are more valuable. But this doesn’t happen in practice. You can argue all you want about theoretically how valuable a position is, but a better way to do it is you let the data tell you want the answer is. For example, in 2017, using 1B as a baseline, the CF was shifted -0.02221 runs per plate appearance and the RF was shifted -0.00801 runs per plate appearance. This indicates that the CF will get more credit than the RF for the same offensive play, but only by 0.0142 per plate appearance. So while I think the argument that offensive players shouldn’t be adjusted for what position they play is a weak argument, even if I do accept it, your complaint is still meaningless because it’s not an issue in practice. The collective RF over the course of a season will outhit the collective CF over the course of a season in practice. It’s possible that someday the CF will outhit the RF, but over the course of TENS OF THOUSANDS of plate appearances this seems unlikely. And further, if that’s what the data is telling us, then we should listen!!!
Ultimately, I don’t think I am right or Tango is right (though Tango thinks openWAR is wrong in this aspect), I just think this particular issue is a matter of choice. We’ve made one choice and Tango has made another and I think they are both valid. I prefer the openWAR way of doing it, but I certainly understand different points of view on this particular choice of adjustments.
I’m just actually wondering if he has.
I did at the time. I think I even had a thread about it. It’s possible I didn’t understand all of it, but I’m pretty sure I understood just about all of it.
We aren’t doing this adjustment at the player level, we are doing this adjustment at the PLATE APPEARANCE level. This means that the average plate appearance involving a CF at the plate will be the same as the average plate appearance involving a RF at the plate. Once all of the plate appearances are aggregated, there is no implication that the average CF has to equal the average LF. It COULD happen, but it would be due to chance.
At the level for the league, and over a period of years, it’ll work out to the same thing just about. Certainly won’t be chance. Or, I have no idea what I’m talking about.
Who cares if it won’t be interpreted properly? It’s the right thing to do statistically!
That in a nutshell is the difference between what Forman and Appelman do, and what everyone else does.
Again, truly astonishing. It sounds to me an awful lot like Tango is saying that what Forman and Appelman do is some data manipulation, and that what everyone else is doing is valid statistics. I don’t really think he means it that way, but it sounds like that. I think (and I’m sure he will tell me if I am wrong) what Tango means is that Forman and Appelman are producing advanced statistics for a general audience and everyone else is making things too complicated. I disagree with this entirely. I don’t think we, as statisticians should dumb it down for a general audience. Instead we should get better at communicating what we are doing. Again, the solution isn’t to dumb it down, it’s to explain out work better.
would casually dismiss a discussion of uncertainty. Maybe I’m misinterpreting this though.
Eh… it’s more that I didn’t think anything would come of it. Compared to the rest of the podcast, where someone driving by could get something out of it, the talk of uncertainty was just theoretical really. We need to see it in action. In other words, it was too early.
That said, I have talked about it for PECOTA, for those who remember when Colin joined BPro, and the “percentile” presentation. I had a few good threads on it, probably around 2009 or 2010.
It’s literally happening as we speak at baseball prospectus.
Here’s one thread on OpenWAR, with Ben Baumer stopping by:
This was an interesting thread, but I’m most interesting in the first comment, which you can read here:
This is so lazy. And as for doubting “it will get any traction”, I’ll just leave this here: Baumer, Brudnicki, McMurray win 2016 SABR Analytics Conference Research Awards #sickburn
You’ll see all the comments in there center around the positional issue.
And more OpenWAR discussion here:
There are some valid (and other not so valid) points that are brought up in this discussion. I’ll highlight one of the not so valid points here:
Tango says: “There is no question that fWAR and rWAR have far more “open source” ideals behind them than openWAR. What? openWAR is fully reproducible with using only publicly available data and the ENTIRE source code is publicly available. This is not true of ANY of the other major implementations of WAR. Argue all you want that openWAR sucks. I’m fine with that (I disagree, but I’m fine with it), but to say that “fWAR and rWAR have far more ‘open source’ ideals behind them than openWAR” seems to me to be a complete misunderstanding of what open source means and what the ideals of open source are.
I was recently on the Stolen Signs podcast to discuss WAR with Sean Forman, David Cameron, Rob McQuown, and Jonathan Judge. You should listen to it because it is interesting. And I’m on it, so you know it’s good. Anyway, Tom Tango recently wrote a blog post live blogging the podcast. It appears he does not like openWAR. I will now respond to his live blog with a live blog of his live blog. (Tango’s words are in bold, My responses are normal font.)
- 1. So far, I agree with everything Sean has said up to this point.
Sean is very smart.
- 2. The “replacement pool” concept for OpenWAR is… questionable. I’ve talked about this in the past. As I keep listening to the discussion as I type, you can see why it is questionable.
Every single other definition of replacement player is “questionable”. I’d also argue that all of the other WAR implementations use a highly ad hoc methods to reach their replacement level. As far as I can tell replacement level for the other methods sets a team of replacement level players to win 48 or 52 games and that’s the level. Maybe it’s more sophisticated than that, but I can’t tell (again, that’s one of the problems with these methods not being transparent and open-source). openWAR actually goes about defining what a replacement level player should be and then ESTIMATES the replacement level based on that definition. We aren’t defining the replacement level, we are letting the data estimate one.
- 3. The OpenWAR guy, Greg, is talking about how his source code is available, and anyone can make the change to just the replacement component. Which is great. Why however is no one doing this?
There are actually three openWAR guys. I’m one of them. The other two are Ben Baumer, a professor at Smith college, and Shane Jensen, a professor at the University of Pennsylvania.
About our source code being open-source, I agree, it is great. Why is nobody doing this? I don’t know. Maybe they are. I have no idea what people are doing with the code. I would guess though that there are a lot of people in the SABR community who just don’t have the technical R skills to work with our code.
- 4. And there you go. At the 21 minute mark. THAT is the questionable choice of OpenWAR. Because the replacement pool by position will get you into trouble. Alot. ALOT.
In practice it doesn’t though. Our replacement level is very close to the other ad hoc choices of replacement level. I believe our replacement level for 2017 ended up being a team that won like 44 games.
We also aren’t defining a replacement pool by position. We are defining a replacement pool of pitchers and a replacement pool of players and from that pool we are estimating a replacement player for each position.
I would also argue that at least we are trying to define what a replacement player is and then estimate it. You can criticize is all you want, but what’s a better way to do it? I haven’t seen one. The definition of replacement level is incredible vague. It’s a concept and trying to define it formally is very tricky. But at least we are trying.
- 5. At the 24-26 minute mark, discussion on the positional adjustments. This is more ripe for aspiring saberists. What Dave said is mostly correct, but what Sean said about using offense is also applied to “round out” how I did it. The basic point, that they all seem to agree to, is that the fielding spectrum from Bill James is essentially the basis.
I have no idea where these numbers come from. They were referred to by Dave as “not most rigorous calculation”, which has be a bit worried from the point of view of a statistician. Does anyone know exactly where these numbers came from?
- 6. We’re now at the 28 minute mark. Position-adjusted OFFENSE in WAR. The pitcher is definitely a separate issue. I again completely disagree with OpenWAR on the issue overall. Just the pitcher is the exception and OpenWAR agrees that most people don’t like it.
As I said in the podcast, a lot of people don’t like this. Here is my argument FOR it: A baseball player on offense doesn’t exist in a vacuum. You can’t just get the nine best hitters and put them in your lineup. Every hitter (minus the DH) has to play somewhere in the field. If you have a shortstop who hits 50 HR’s that is more valuable than a 1B who hits 50 HR. You can’t just throw 9 batters out there, they also need to play a position. I don’t think that should be ignored. The pitcher is just the most extreme case of the need for a positional adjustment, but if you are going to do it for the pitcher (and you do need to do it for the pitcher) you should also be doing it for the other players. But ultimately, I don’t think this is a matter of right and wrong, it’s just a difference of opinion. And that’s ok.
Note about what is said around 27:30: I also think there is a misunderstanding about what I, openWAR Guy, mean by “offensive position adjustments”. Sean says: “As Tango has pointed out, there are years where the center fielder out hits the right fielder. You just happen to have Mike Trout….and center fielders are awesome right now. And if we use the offensive difference we end up with the conclusion that right fielders are better defensively than center fielders. Which is clearly wrong.” I totally agree with this. We shouldn’t conclude anything about the defensive ability of any player based on their offensive performance. And openWAR isn’t doing that. Our offensive positional adjustment will have NO impact on that player defensively. They are completely separate from each other. We are merely adjusting a player’s offensive performance based on where they are playing in the field. We think this is important because if two players have the exact same stats and one is a short stop and the other is a first baseman, we believe that the shortstop is providing more offensive value to that team because they are filling a difficult fielding position. We are NOT, however, drawing any conclusions about their fielding ability from their offensive performance.
- 7. So, this is why they are wrong. It implies the average offensive CF = average offensive LF. And this is true even if the average CF hits better than the average LF. This was the issue Pete Palmer faced with the Hidden Game. He’s wrong then, and OpenWAR is wrong now.
I really wonder if Tango has actually read the entire openWAR paper? I’m not saying that to be a jerk, I’m just actually wondering if he has.
Anyway, I think Tango is misunderstanding something here. We are not implying that “the average offensive CF = average offensive LF”. We aren’t doing this adjustment at the player level, we are doing this adjustment at the PLATE APPEARANCE level. This means that the average plate appearance involving a CF at the plate will be the same as the average plate appearance involving a RF at the plate. Once all of the plate appearances are aggregated, there is no implication that the average CF has to equal the average LF. It COULD happen, but it would be due to chance.
Also, as a note, I don’t think anything we did was “wrong”. Nothing that we are doing is “wrong”. It’s just a different take on WAR, and you are free to like or dislike any or all of it. But it’s not “wrong”. Just as fWAR, and bWAR, and WARP aren’t “wrong” they are just different.
- 8. OpenWAR guy, Greg, is correct that at the 31 minute mark that the pitcher is an exception, but he is wrong in applying that to EVERY position. Pete was wrong 35 years ago, and that’s why we are where we are. This is why OpenWAR is being held back. They can argue all they want, but this is an argument from the 1980s that has been argued and rejected. Let’s move on.
I think being called openWAR guy is meant to be disrespectful and I will not put up with it. In the future, please refer to me by my actual title: Baseball Prospectus Intern Guy. (I’m JOKING. I don’t actually think it’s disrespectful. I kind of like the title……).
Again, I’ll refer to my answer to the previous question. We are doing the adjustment at the plate appearance level and it has nothing to do with defensive value.
Also again, we aren’t “wrong” we just have a different opinion. I also don’t think openWAR is being held back. It just takes a long time for these things to catch on. And with a mention from someone as important and influential as Tom Tango Guy, I think openWAR is ready to explode. On a serious note, I think there are a lot of people in the SABR world who simply don’t possess the technical skills to wade through all of our code, even if they wanted to.
- 9. Sean at the 33 minute mark starts speaking, and I know he is going to say the right thing. And he is. I’m just going to call it now, I agree with whatever Sean will say for the rest of the podcast. I’ll let you know if I change my mind.
Sean is smart.
- 10. 36 minute mark talking about uncertainty. I’m as boring as Jonathan. I love chicken caesar.
My most important contribution to the podcast episode was to ask what kind of salad Judge was eatings. I agree with Tango here. Everyone loves chicken Caesar.
- 11. My comment: part of the issue of uncertainty in WAR is not the measurement, but also the CHOICE. For example, using seasonal numbers rather than RE24.
- 12. As for the actual measurement issue Jonathan is talking about, at the 39 minute mark, Jonathan nails the issue: I think people will ignore the confidence interval. It’s really an issue of presentation.
People might ignore the interval. They probably will. But who cares? What is your goal? To just cater to what a reader wants? Or to do statistically rigorous and valid work? Those are not always the same thing and that’s ok.
- 13. Sean’s talking about this now at the 40 minute mark. My earlier comment applies.
But what is your goal?
- 14. At the 44 minute mark. Yes, this is a presentation issue. This is what the issue is. That, and ultimately, interpretation. No way that 1 SD will be interpreted properly. Nor will 2 SD if that’s the range they will show.
Who cares if it won’t be interpreted properly? It’s the right thing to do statistically!
- 15. Still waiting on the runs to wins conversion. This was almost the entire issue that Poz and James brought up. Doesn’t look they are going to get into it.
We did not get into it. But I have some thoughts. Though I’ll save them for another time. According to Harry, “we didn’t want to get into it, the point of the show was to move past Bill’s look backwards.”
- 16. The error bar is going to be pretty similar player to player, especially for full time players. I think you just need to show that high-level, league-wide, rather than player by player.
Is this true? I really don’t know how different these errors will be. I think it’s possible that error bars will be bigger for players who strikeout and hit home runs a lot as they are higher variance players than guys who have high OBP and get lots of singles and walks. But I don’t know if this difference is large enough to be really substantial. So I think I mostly agree with Tango on this point, especially for full time players. The error bars will be smaller for part time players since WAR is essentially a counting statistics and the variance will be lower for lower values of WAR. I just have no idea of how much smaller it will be.
- 17. Whoever is talking at the 48 minute mark asks the pertinent question. I think it’s Dave. But yeah, we SHOULD see the extreme players have an asymmetric range… BUT, the overall average will still be correct. So, I don’t see the benefit of saying that a guy’s +20 fielding runs is between +5 and +25, while still maintaining +20 as the average for that distribution. If you show the range as 5-25, someone, I promise you, will think the “average” is now 15.
Again, if the goal is to do statistically rigorous and valid work, it’s ok to present an asymmetric interval.
- 18. Ok, well that was fun. I recommend the first 36 minutes, maybe up to the 40th minute. After that, it’s all about the uncertainty, and I don’t think you’ll learn much.
I’m not offended by being called openWAR guy, but I am offended when someone doesn’t care about uncertainty! One of the MAIN goals of statistics and statistical analysis is to quantify the uncertainty in your estimates. This is (or should be) taught in every intro statistics class. This is really important! And I’m a bit disappointed that someone with such a high profile in the SABR world (which I just view as a statistics world applied to baseball) would casually dismiss a discussion of uncertainty. Maybe I’m misinterpreting this though.
Cleveland at Cincinnati
Prediction: Bengals 23-15
Pick: Bengals -7.5
Total: Under 39
Tennessee at Indianapolis
Prediction: Titans 25-19
Pick: Titans -3.5
Total: Under 47.5
Buffalo at Kansas City
Prediction: Chiefs 28-17
Pick: Chiefs -9
Total: Under 49
Miami at New England
Prediction: Patriots 32-15
Pick: Dolphins +17
Total: Under 49
Carolina at NY Jets
Prediction: Jets 19-18
Pick: Jets +6
Total: Under 40
Chicago at Philadelphia
Prediction: Eagles 28-13
Pick: Eagles -14
Total: Under 44
New Orleans at LA Rams
Prediction: Rams 28-26
Seattle at San Francisco
Prediction: Seahawks 28-17
Pick: Seahawks -7
Total: Under 45.5
Jacksonville at Arizona
Prediction: Jaguars 21-15
Pick: Cardinals +6
Total: Under 38
Denver at Oakland
Prediction: Raiders 24-20
Pick: Raiders -3.5
Total: Over 42
Green Bay at Pittsburgh
Prediction: Steelers 27-15
Pick: Packers +14
Total: Under 43.5
Houston at Baltimore
Prediction: Ravens 23-16
Pick: Texans +7
Total: Over 38
Most of the people who know what they are doing don’t have the Philadelphia Eagles ranked number 1 (e.g. TeamRankings.com, ThePowerRank.com, FiveThirtyEight.com) even though the Eagles are 8-1 and have the best record in the NFL. So I just did a quick look at their schedule to see who they have played. It’s not great. They’ve played only two teams with winning records (Carolina and Kansas City) and in those gams they are 1-1 (They beat Carolina and lost to KC). The full list of their opponents records looks like this:
That adds up to an opponents’ collective record of 33-50 or about 39.8% winning percentage. A team with that winning percentage would win 6.36 games in a season with 16 games.
Now let’s compare that with the Atlanta Falcons who are 5-4. They have played 6 teams with winning records and are 3-3 in those games. Their opponents’ records look like this:
This gives a collective record for their opponents of 45-38, which is about 54.2%. That winning percentage amounts to winning 8.67 games out of 16.
This is a really simplistic way of looking at strength of schedule, but with a difference this stark you don’t need any complex analysis to see it. This difference in schedules is huge through 9 games for the Eagles and the Falcons so that the Eagles probably aren’t really as good as their record indicates and the Falcons are probably a bit better than their record. It’s so easy to see that from just looking at their schedules. That’s why when I see the Eagles ranked number 1 at this point in the season in a set of rankings, it’s hard to take those rankings seriously. For instance, ESPN. Why don’t they just use the rankings from FiveThirtyEight, which are actually based on, you know, data.
- New England (7-2)
- Atlanta (5-4)
- Pittsburgh (8-2)
- Dallas (5-4)
- Kansas City (6-3)
- LA Rams (7-2)
- New Orleans (7-2)
- Seattle (6-3)
- Philadelphia (8-1)
- Jacksonville (6-3)
- Minnesota (7-2)
- Oakland (4-5)
- Carolina (7-3)
- LA Chargers (3-6)
- Tennessee (6-4)
- Buffalo (5-4)
- Detroit (5-4)
- NY Jets (4-6)
- Washington (4-5)
- Baltimore (4-5)
- Cincinnati (3-6)
- Denver (3-6)
- Chicago (3-6)
- Tampa Bay (3-6)
- Arizona (4-5)
- NY Giants (1-8)
- Green Bay (5-4)
- Miami (4-5)
- Houston (3-6)
- Indianapolis (3-7)
- Cleveland (0-9)
- San Francisco (1-9)