Category Archives: Uncategorized
Kathy Explains all of Statistics in 30 Seconds and “How to Succeed in Sports Analytics” in 30 Seconds
@causalKathy explains all of statistics in 30 seconds.
I spent the weekend of October 19-21 in Pittsburgh at the 2018 CMU Sports Analytics Conference. One of the highlights of the weekend was Sam Ventura asking me to explain causal inference in 15 seconds. I couldn’t quite do it, but it morphed into trying to explain all of statistics in 30 seconds. Which I then had to repeat a few times over the weekend. Figured I’d post it so people can stop asking. I’m expanding slightly.
Kathy Explains all of Statistics in 30 Seconds
Broadly speaking, statistics can be broken up into three categories: description, prediction, and inference.
- Mapping inputs to outputs
- Predicting outcomes and distributions
- Inference/Causal Inference
- Prediction if the world had been different
- Counterfactual/potential outcome prediction
I’ll give an example in the sports analytics world, specifically basketball (this part is what I will say if I only have 30 seconds):
- Slicing your…
View original post 589 more words
This morning Yahoo Sports published a piece titled “Op-ed: How one flawed study and irresponsible reporting launched a wave of CTE hysteria,” written by Merril Hoge, a former NFL running back and ESPN analyst, and Dr. Peter Cummings, an Assistant Professor with the Boston University School of Medicine. In it, they criticize the methods of an article that was published in the Journal of the American Medical Association (JAMA) in 2017 entitled “Clinicopathological Evaluation of Chronic Traumatic Encephalopathy in Players of American Football.” The JAMA article, as they point out, has been an important part of widespread discussions about football and brain injury.
Hoge and Cummings make three major points against the article (which they call strike 1, 2, and 3. Get it. It’s a sports reference!):
- There was no control group
- There was selection bias
- Failed to control for external factors
And you know what? The authors are exactly correct.
What kind of moron would design a study without a control group? How can you design a randomized control trial (RCT) without a control group?!?! It’s right there in the name! It’s insane. Amateur hour. Next, as anyone with half a brain knows, you have to select your subjects randomly in a RCT. It’s also right in the name!! Randomized. Control. Trial. And then, get this, they didn’t even control for external factors! As Cummings and Hoge point out, “nearly half the players had a history of substance abuse, suicidal thinking or a family history of psychiatric problems.” Can you imagine not controlling for that stuff in a randomized control trial? How could anyone be so stupid?
Ok, seriously. How could someone do a randomized control trial without having a control group, randomization, and controlling for external factors?
Well…they didn’t. The JAMA study wasn’t a randomized control trial. It was a case series. The original authors made no attempt to do a randomized control trial. If you look at the original study, the authors note in the very first line of the findings section of the abstract that they are using a “convenience sample.” This means that they know they aren’t doing a randomized control trial. Which means they aren’t even attempting to show a causal link between playing football and CTE. In fact, in the Wikipedia entry for case series it explicitly mentions that “unlike studies that employ an analytic design (e.g. randomized control trials)…case series do not…look for evidence of cause and effect.” This is how the actual authors summarize their findings in the conclusions section of the paper:
In a convenience sample of deceased players of American football, a high proportion showed pathological evidence of CTE, suggesting that CTE may be related to prior participation in football.
They NEVER claim a causal link. Which makes this statement by Hoge and Cummings all the more odd:
Then we took a closer look at the study that led to the Times story — apparently something few journalists had bothered to do. When we dug into the methodology, we were floored. The study was so badly flawed that it was nearly worthless. But that’s not what had been reported in practically every major media outlet in the world. Thanks to the barrage of sensationalist coverage, the “110 out of 111 brains” story had turned into a wildfire, and we were standing around with a couple of garden hoses, telling everybody to calm down.
They criticize the authors of the Times story for not taking a closer look at the original paper. But I wonder if Hoge and Cummings actually read the original paper. Their article on Yahoo Sports criticizes three aspects of the original study that literally aren’t a part of the original study.
The cynic in me thinks maybe all of this is just a ploy to sell more books. Because as you’ll notice, Hoge and Cummings are promoting a book called “Brainwashed: The Bad Science Behind CTE and the Plot to Destroy Football”, which I’m not going to link to because, based on this Yahoo Sports article, I’m guessing it’s trash*.
Side Note 1
I will admit that Hoge and Cummings do have a point that the media coverage of this research was fairly skewed. The media plays up the potential link between football and CTE and maybe doesn’t fairly state that as of yet no causal relationship has been established between playing football and CTE. But even if the media coverage was fair, what Hoge and Cummings should be doing is calling for more, high quality research into the relationship between football and CTE instead of trashing a case series for not being a randomized control trial. Because right now the truth is that we simply don’t know if there is causal relationship. But we also don’t know that there isn’t a causal relationship.
Side Note 2
Smoking and lung cancer is a famous example where there was clearly a correlation known for a long time, and for decades people were talking about how the relationship wasn’t shown to be causal. Eventually, scientists were able to show a causal link. But it was all the demonstrated correlation between lung cancer and smoking that led people to study the causal relationship between the two. So, this CTE study is important not because it shows a causal link—which, again, no one is trying to demonstrate here—but because it is consistent with what we would expect if the relationship between CTE and football were causal, and will lead to more work on studying that relationship. This is an important part of how science works.
And finally, remember, “correlation does not even imply correlation”
*I need to be clear that I HAVE NOT read the book. The book could be amazing. I’m ONLY commenting on the Yahoo Sports Op-Ed (added 10/24/2018 – 8:03am).
Wins are a shitty statistics in baseball. You can be a great pitcher on a terrible team and end up with a win-loss record that isn’t impressive at all. For instance you could be Jacob DeGrom and be an outstanding pitcher on an awful team (The Mets were 77-85). deGrom ended the season with a win-loss record of 10-9 and the Mets were 14-18 in games that he started. He also ended the season with an ERA of 1.70 and only gave up more than 3 earned runs in an outing once during the entire season (He gave up 4 earned runs on April 10 against Miami). From May 18th through the end of the season he had 24 quality starts in a row (6+ IP, 3 or fewer runs). 24 QS IN A ROW!
So I wanted to look at how someone could be so dominant could end up with only 10 wins and their team going 14-18 when they started. So the first thing I looked at was the scores of these games. Maybe the Mets weren’t scoring a ton of runs. In 21 of deGrom’s 32 starts (65.625%) the Mets scored three or fewer runs and the Mets were 4-17 in these games. (The league average for runs per game in 2018 was 4.45.) And in only 7 games (21.875%) did the mets score 6 or more runs. They were 6-1 in these games. Below is the scatterplot for the scores of Mets games that deGrom started in the 2018 season.
Next I looked at the number of earned runs that deGrom gave up in his start and how many runs the Mets’ opponent ended up with at the end of the game. The plot below shows the number of earned runs allowed by deGrom in an outing versus the total number of runs allowed by the mets. In only 26 of deGrom’s 32 starts the Mets managed to give up at least one run that wasn’t credited to deGrom (This could either be the bullpen giving up runs, or unearned runs because of errors. Either way, not deGrom’s fault).
Think about this. In deGrom’s 32 starts he pitched a total of 217 innings and gave up 41 earned runs. That’s about 75.35% of the innings in those games (assuming all games went a full 9 innings). Between unearned runs and runs given up by other Mets’ pitchers, the Mets allowed their opponents 63 more runs. Think about that. In games when deGrom started he pitched 217 of the total innings and the Met’s needed to only get through 71 more innings. 63 runs that weren’t deGroms fault in 71 innings. While this isn’t exactly a runs per game calculation 63 runs in 71 innings is almost 8 runs per 9 innings. Did I mention the Met’s were bad?
So how many games should the Met’s have won with deGrom starting this year? I sort of checked the answer to this question by looking at home many wins the Mets would have had in deGrom starts if the bullpen was simply league average. To do this, I computed the league average runs per out rate and then drew from a Poisson variable with this mean take n draws where n is the number of outs left in the game. I then added up the number of runs given up by the bull pen and added the to deGrom’s ER for that game. I then counted how many times the Mets would have won a game (with ties gonig 50-50 to each team). The mean is 18.85 with a median of 19 wins and 95% of the simulations had a win total between 16-22. Basically this means that, in my crude calculations, the Met’s bullpen cost the Mets somewhere between 2 and 8 wins in deGrom’s starts.
The histogram of the simulated values of the number of Mets wins if deGrom had a league average bullpen can be seen below. Almost never is it 14, the actual win total for the Mets in deGrom’s starts.
So should deGrom win the Cy Young Award with a 10-9 record? Well his ERA was 1.70. The next best in the National league was Nola at 2.37. And the only other pitcher in the entire league to end the season with an ERA under 2 was Blake Snell (1.89) who is basically a lock to win the AL Cy Young after going 21-5. So would I give it to deGrom? Probably. But I wouldn’t be that upset if Nola won it in the NL.
The one thing we can all agree on is the Mets suck.
Finally, you can see my code here. It’s a mess, but there it is: https://github.com/gjm112/StatsInTheWild/blob/master/deGrom.R
Last summer I wrote about how much water hurricane Harvey had dropped on the Houston area. And I decided to revisit this with Florence making landfall. Florence is expected to drop 10 trillion gallons of water on the Carolinas and the surrounding regions. That’s a ton of water. Here it is relative to the great lakes.
But here’s the crazy part. Harvey dropped 25 trillions gallons of water on Houston! Florence is going to drop more water than the Great Salt Lake. And Harvey dropped 2.5 times that!!!! Holy crap Harvey.
Update: Someone on Twitter suggested that these advancement grids should be weighted by how likely the scenario is. Here is what those looked like before the last games of Groups G and H.
Each team in the World Cup is through 2 games and every team has one game left in the group stage. Below you will find graphics for each teams advancement scenarios based on the two remaining games in their group.
- Green indicates that a team will win their group.
- Yellow means they finish second, and red means they are eliminated.
- Light green means they are tied for first after points and goal differential and the winner is determined by further tie breakers.
- Orange indicates a tie for second after points and goal differential and the team that moves on is determined by further tie breakers.
- Gray indicates a three way tie after points and goal differential and further tie breakers are applied to decide who moves on.
This is a pretty boring group.
- Russia wins the group with a win or tie.
- Uruguay wins the group with a win.
- Egypt is out.
- Saudi Arabia is out.
Group B is much more interesting that A.
- Morocco has been eliminated.
- Iran advances with a win OR a tie and Morocco wins by 2 or more over Spain. If Morocco wins by 1 over Spain and Iran wins, it goes to goals for as a tie breaker. Iran can actually still win the group with a win and a Morocco win or tie.
- Portugal advances with a win or tie. They can also advance with a loss and a Morocco win. As long as Morocco beats Spain by more than Portugal loses to Iran.
- Spain advances in basically all scenarios EXCEPT a loss and an Iran tie OR a loss and a Morocco win by more than Iran beats Portugal.
- France is through. They can win the group with a win or a tie over Denmark.
- Denmark advances unless they lose and Australia wins.
- Australia gets in with a win and a loss by Denmark. If Australia wins by 1 and France wins by 1, Denmark and Australia tie for second and the team with more goals scored would advance. If it’s still tied the rest of the tie breakers would be applied.
- Peru is elimnated.
- Croatia is through. The win the group unless they lose and Nigeria wins AND Nigeria can make up a goal differential of 5.
- Nigeria advances with any win and can advance with a tie as long as Iceland doesn’t win by 3 or more.
- Iceland can advance with a win and an Argentina win or tie. But they still need to make up a goal differential of 2.
- Argentina advance with a win and a Croatia win or tie. They can also advance with a win and an Iceland win, but they would need to make up the 1 goal differential with Iceland.
- Brazil advances with a win or tie. They can still advance with a loss as long as Costa Rica wins and Brazil maintains its 1 goal differential advantage over Switzerland.
- Switzerland is through with a win or tie. They also advance with any Brazil win. They can also advance with a loss as long as Serbia beats Brazil and they can make up the 1 goal differential behind Brazil.
- Serbia advances with a win. Or they can advance with a tie and a Costa Rica win.
- Costa Rica is eliminated.
Group F is nuts.
- Mexico is 2-0-0 and hasn’t clinched yet. They win the group with a win or a tie and there are even some scenarios where they win the group with a one goal loss and a South Korean tie or victory.
- Germany, who was almost eliminated by Sweden, advances with a win* or a tie and a Mexico win. Germany will tie for second if they tie and Mexico ties or if they lose by 1 and Mexico wins.
- Sweden advance with a win and a German loss. Or a tie and German loss. Or a win and a German tie. In fact they most likely advance with a win, with a few exceptions*.
- South Korea, who is currently sitting on two losses and still somehow advance. They simply need to beat Germany by 2 and have Mexico beat Sweden. Simple……
*If Sweden wins by 1 and Germany wins by 1, Sweden, Germany, and Mexico would be in a three way tie for first with 6 points each and the same goal differential. That means some team with 6 point would not advance. Heartbreaking. 6 points is a lot.
- England is through and wins the group with a win over England.
- Belgium is through and wins the group with a win over Belgium.
- Tunisia is eliminated.
- Panama is eliminated.
If Belgium and England tie, all the tiebreakers are tied down to fair play points and the winner of the group will be chosen based on who has fewer yellow cards.
- Japan is through with a win or tie. They can also advance with a loss and a Senegal win.
- Senegal is through with a win or tie. They can also advance with a loss and a Columbia win but that would come down to goal differential to break their tie with Japan.
- Columbia advances with any win or with a tie and a Poland win.
- Poland is eliminated.
So I was watching game 1 of the NBA finals, and I got to thinking about how some of these players have been around for a long time in the NBA (Lebron was drafted in 2003!!!) So I to basketball reference and looked back at some old drafts. Then I got the idea to scrape drafts and current rosters to see what each current NBA team looks like in terms of the years and positions in a draft. This led to me staying up way past my bed time screwing around with rvest and plotly.
What I started with was a scatter plot of draft position on the x-axis versus year drafted on the y-axis with each point having the color of their team. That’s easy enough to do in ggplot. But what I really wanted to do was make it interactive so that when you clicked on a point, all the other points for the team will also highlight. Now on Thursday night/very early Friday morning, I had no idea how to do this. And it drove me F-ing crazy until like 2 or 3 in the morning. You think I’m kidding, but look at my fitbit sleep Friday night. That has nothing to do with a kid; I just couldn’t figure out how to do this.
So Friday I wake up at like 6 barely functional, get baby out the door, go to my only meeting of the day, and then I happened to be having lunch with Carson Sievert who was in Chicago for an R conference. So I mention this problem to him after we finished eating tacos, and he casually pulls out his laptop and shows me:
That’s it. That’s all you need to do to make that work. The full plot code is then:
draft2 %>% SharedData$new(~Team) %>% plot_ly(x = ~Rk,y = ~Year, color = ~Team, text = ~paste(Player,Team), colors = pal)
That’s it. Check out the plot it makes here!
Next what I wanted to do was add some convex hulls around the points. Apparently this is super easy to do too using geom_polygon and the chulls function. Check out the convex hulls plot here. At first it looks like a mess, but double click on the legend on the right to choose what to add to the plot. For instance, below is a screen show of the convex hulls of the final 4 teams in the NBA playoffs. What’s so notable about these teams is that Golden State, Cleveland and Houston have very similar shapes indicating that their teams are made up of some very high draft picks from several years ago, but notably no high draft picks from very recently. But look at Boston. Totally different shape mostly in the upper left corner (indicating high draft picks in very recent years). Go play around with that plot. It’s really interesting.
What I think I’d like to do next is to see how these convex hulls change over time for an individual team. Or if someone has some free time you can take my code and modify it to do that. My full github code for scraping the data and making the plots here.
April 3, 2018 – Yankees vs Rays – Didi Gregorius: 6.58 RAA.bat, 4-4, BB, 3 R, 8 RBI, Double, 2 HR
- Bases Empty – Double
- Runners on 1st and 3rd – Homerun
- Runner on 1st – Walk
- Runners on 1st and 3rd – Homerun
- Bases Loaded – Single
April 13, 2018 – Angels vs Royals – Abraham Almonte: -3.09 RAA.bat, 0-5, K, 2 GIDP
- Bases Empty – Strikeout
- Runners on 1st and 2nd – Grounded into Double Play
- Runner on 3rd – Groundout
- Bases loaded – Groundout
- Runner on 1st – Grounded into Double Play
April 9, 2018 – Diamondbacks at Giants – Zack Godley: 4.32 RAA.pitch, 7 IP, 4 H, 9 K, 0 ER, 23 Batters Faced
- Lineout – K – K
- Single – K – GIDP
- K – K – K
- Groundout – K – K
- Single – Forceout – Single – Pop Out – Groundout
- Groundout – Groundout – K
- Single – Forceout – GIDP
April 7, 2018 – Marlins at Phillies – Dillon Peters: -7.56 RAA.pitch, 2.2 IP, 9 H, 9 ER, 3 BB, 3 K, 2 HR, 19 Batters Faced
- Walk – Single- Single – Walk – K – HR – Flyout – K
- Single – GIDP – Groundout
- K – Single – Single – Walk – HR – Single – Pop Out – Single
April 15, 2018 – Rockies vs Nationals – Michael Taylor: 1.59 RAA.br
- Walk – Advances to 2B on a sac bunt – Advanced to third on walk – Scores on passed ball
- Double to LF – Steals 3rd – Scores on passed ball
May 6, 2018 – Rockies vs Mets – David Dahl: -1.49 RAA.br
- Single – Steals 2B (Arenado then walks, Dahl gets no credit for the steal! This needs to be fixed!)
- Double – Thrown out trying to advance to third on a ground out to the shortstop.
One of the nice things about openWAR is that you can compute it over any time period and you can look at its individual components. Here I’ve looked at Mike Trout’s batting component in openWAR (raa.bat) over the course of the 2018 season. His best game performance so far in terms of hitting was on 4/8/2018 when he amasses 1.95 raa.bat by going 2-3 with a HR, 2RBI, a walk and a strikeout. His worst game so far was worth 1.62 raa.bat on 3/29/2018 where Trout went 0-6 with a strikeout. As you can see, Trout started relative slowly over the first week of the season but since April 8, 18 of his last 23 games have been positive raa.bat. If he keeps up that kind of production, this kid might have a future in the major leagues…..