Author Archives: statsinthewild
I attending every single JSM for 8 straight years from 2010 – 2017 (Vancouver, Miami, San Diego, Montreal, Boston, Seattle, Chicago, Baltimore). Unfortunately, I missed last year in Vancouver due to the birth of my second child, but I suppose it was worth it (Benji just turned 1!). This year I return to JSM in Denver for what would have been my 10th in a row had I not missed last year, and I am very excited.
I’ll be in Denver from Sunday afternoon through Friday evening (as I am staying an extra day for the Rocky Mountain Symposium on Analytics in Sports hosted by the University of Denver). I’ve been working out my schedule for the week, and here’s what I have so far. If there is something you think I should go to that you think I’d like and it’s not listed here, please let me know.
And as always, don’t forget to check out the data art show!
4 – 5:50pm: Session Number: 79, Location: CC-502, “Functional Data Analysis: Methods and Applications— Contributed”
I’m giving a talk in this session related to the work I’ve been doing on my NSF grant about classification of partially observed shapes with applications to biological anthropology. You can view my slides here: http://rpubs.com/statsinthewild/JSM2019_slides
6:05 – 7pm: 2019 JSM Public Lecture—Invited ASA 6:05 p.m. Data Tripper: Distinguishing Authorship of Beatles Songs through Data Science – Mark Glickman
I believe this is the first time they are ever doing a public lecture like this. Anyone can attend even if you haven’t signed up for the conference. This is a great way to reach out to the public and people who don’t want to pay to attend JSM to see a really interesting talk on data science. And getting Mark Glickman to do the first one is a great selection. He’s a super interesting speaker and this topic is super interesting.
8:30 – 10:20am: Session Number: 98, Location: CC-607, “The Multiple Adaptations of Multiple Imputation— Invited”
This session has Jerry Reiter, Trivellore Raghunathan, and Donald Rubin (Rubin is my adviser’s adviser’s adviser……). I spent 4 years of my life reading approximately a billion papers by these three. If you are at all interested in synthetic data and/or multiple imputation, you should definitely check this out. I know it’s at 8:30 in the morning. But if I can make it, you can make it too. And besides. It’s Mountain Time. So it’s like 10:30 if you live on the east coast or 9:30 if you live in flyover country like me.
9 – 11am: Location: Room H- Mineral Hall A at the Hyatt Regency Denver, Data Fest Meeting
So I know that I just said how awesome that 8:30am session is going to be, but I’ll be missing it because the Data Fest meeting is from 9am-11am. This is where I’ll be during that time. (Side note: Loyola just had it’s fourth annual DataFest this past Spring. I can’t believe I’ve done it 4 times!)
1pm-2pm: Research meeting with some colleagues. Talking about incomplete shapes!
4 – 6pm: Session Number: 261, Location: CC-Four Seasons 2-4, ASA President’s Invited Address—Invited JSM Partner Societies, “Coming to Our Census: How Social Statistics Underpin Our Democracy (And Republic),” Teresa A. Sullivan, University of Virginia
5 – 6pm: Section of Statistics in Sports mixer, Location: Rock Bottom
I go to this every year. Stat nerds, Sports, Beer. It’s where I belong.
8:35 – 10:20am: Session Number: 292, Location: CC-709. “Providing Access to Useful Data While Preserving Confidentiality—Topic Contributed Survey Research Methods Section, Government”
My adviser, Ofer Harel, will be presenting some work that we did together a few years ago bout privacy and ROC curves.
10:35 – 11:50am: Session Number: 344, Location: CC-506. “Expanding Data Utility – Issues in Disclosure and Modeling—Contributed”
I’m particularly interested in the talk at 11:05am “Using Generative Adversarial Networks to Generate Synthetic Population”. The first time I ever read about GANs my immediate thought was: Synthetic Data! I’m interested to see what they are doing here.
10:35 – 11:50am: Session Number: 323, Location: CC-708. “Causal Inference in Sports Statistics—Invited”
I’m going to try to get to this one for the last two talks starting at 11:25.
Evening: Rockies vs Dodgers
Wednesday – Friday Schedule Coming soon.
And for some reason, I thought that would look really cool rotated 90 degrees. So after a few hours of playing around with win probabilities (and smoothing), I came up with these. These are the win probabilities for every NFL team for the 2018 season. I think these look awesome, and I’ll probably bring a few of these with me to the JSM Data Art show.
I particularly like New Orleans, Jacksonville, and New England:
(Wouldn’t this look nice in your office?)
If you are interested in ordering one of these, DM me at @statsinthewild.
No need to DM me, you can order them now here:
I want to start with this plot:
This plot shows all the 1 and 2 seeds and their odds of making it to each round relative to Duke. I think the most interesting parts of this plot are where they cross. So for instance, Gonzaga is more likely to make the Final 4 than Virginia, but Virginia is more likely to make the Finals than Gonzaga. That’s because Gonzaga will likely have to play Duke to get to the Finals. Whereas Virginia will likely only (ONLY) have to play North Carolina/Kentucky to get to the Finals.
Another example is that Tennessee is more likely to make the Sweet Sixteen than Michigan State, but Michigan State is more likely to make the Elite 8 than Tennessee. It then switches back again and Tennessee is more likely to make it to the Final 4, Finals, and win it all than Michigan State. So what is going on here? It’s Duke again. In order for Michigan State to get to the Final Four, they’ll probably have to beat Duke.
Zion Williamson is really good. Thus, Duke is really good.
Now here is some more fun stuff:
Most likely double digit seeds to win first round game
(10) Seton Hall over Wofford – 41.25%
(10) Florida over Nevada – 40.23%
(10) Minnesota over Louisville – 32.5%
(11) Arizona State over Buffalo – 32.5% (given they win the play in game)
(11) St. Mary’s over Villanova – 28.35%
(11) Ohio State over Iowa State – 27.84%
(11) Belmont over Maryland – 25.52%
(12) Oregon over Wisconsin – 24.95%
(12) Murray State over Marquette – 24.1%
(12) Liberty over Mississippi State – 21.52%
(14) Georgia State over Houston – 17.6%
(14) Yale over LSU – 17.35%
(13) Northeastern over Kansas – 13.95%
(13) UC Irvine over Kansas State – 13.6%
(13) St. Louis over Virginia Tech – 13.6%
(13) Vermont over Florida State – 12.8%
Biggest toss-up first round games
(10) Iowa over (7) Cincy – 50.38%
(8) Mississippi over (9) Oklahoma – 50.65%
(9) UCF over (8) VCU – 50.69%
(8) Syracuse over (9) Baylor (51.97%)
Most like to make it to the Sweet 16
(1) Duke – 90.67%
(1) Gonzaga – 83.71%
(1) Virginia – 83.6%
(1) North Carolina – 83.34%
(2) Kentucky – 79.82%
(2) Tennessee – 76.61%
(2) Michigan State – 74.27%
(3) Texas Tech – 69.74%
(3) Purdue – 66.27%
(2) Michigan – 64.83% (They are going to have to beat either Nevada or Florida. Both are tough opponents in for a 2 seed in the second round.)
Most likely double digit seed to make Sweet Sixteen
(10) Florida – 11.53%
(10) Iowa 11.27%
(11) Ohio St – 11.22%
(12) Oregon – 8.15%
(11) Belmont – 7.32%
(10) Seton Hall – 7.02%
Most likely 5 seed or higher to make Final Four
(5) Auburn – 8.87%
(6) Iowa State – 7.41%
(5) Wisconsin – 4.96%
(7) Nevada – 3.62%
(5) Mississippi State – 3.55%
Most likely double digit seed to make Final Four
(10) Florida – 1.04%
(10) Iowa 1.00%
(11) Ohio St – 0.71%
(10) Seton Hall – 0.033%
Most likely to win the tournament
(1) Duke – 21.07%
(1) Virginia – 15.94%
(1) Gonzaga – 15.78%
(2) Kentucky – 8.16%
(1) North Carolina – 7.96%
(2) Tennessee – 6.74%
(2) Michigan St – 5.4%
(2) Michigan – 3.87%
(3) Purdue – 3.28%
(3) Texas Tech – 2.88%
(5) Auburn – 1.09%
(6) Iowa State – 0.91%
(4) Kansas – 0.86%
Well, just like last year, I’m predicting that Loyola Chicago won’t make the Final Four. But I’ll probably be wrong again somehow…….. Anyway, here are some thoughts about the NCAA tournament from a guy who hasn’t paid attention all year until last week.
Here is how I CURRENTLY rank the tournament teams. This is based on heavily weighting recent performances of teams. I’ll have more later, but for now I hope Harrel enjoys these. Now I have to do “real work”……..
- Gonzaga (1)
- Duke (1)
- Virginia (1)
- Kentucky (2)
- North Carolina (1)
- Tennessee (2)
- Michigan State (2)
- Michigan (2)
- Purdue (3)
- Texas Tech (3)
- Nevada (7)
- Kansas (4)
- Auburn (5)
- Iowa St (6)
- Florida St (4)
- Wisconsin (5)
- Virginia Tech (4)
- Houston (3)
- Villanova (6)
- LSU (3)
- Kansas St (4)
- Maryland (6)
- Mississippi St (5)
- Louisville (7)
- Buffalo (6)
- Marquette (5)
- Cincinnati (7)
- Syracuse (8)
- Iowa (10)
- Florida (10)
- Utah St (8)
- Baylor (9)
- Mississippi (8)
- Oklahoma (9)
- Ohio State (11)
- Washington (9)
- Oregon (12)
- Wofford (7)
- UCF (9)
- Minnesota (10)
- St. John’s (11)
- VCU (8)
- St. Mary’s (CA) (11)
- Arizona St (11)
- Seton Hall (10)
- Belmont (11)
- Temple (11)
- New Mexico St (12)
- Murray St (12)
- Yale (14)
- Northeastern (13)
- Liberty (12)
- Vermont (13)
- St. Louis (13)
- UC Irvine (13)
- Montana (15)
- Old Dominion (14)
- N Kentucky (14)
- Georgia St (14)
- Colgate (15)
- Bradley (15)
- Iona (16)
- N Dakota St (16)
- Gardner-Webb (16)
- Fairleigh Dickinson (16)
- Prairie View A&M (16)
- NC Central (16)
- Abilene Christian (15)
Round of 32
- Miss St
- Virginia Tech
- Michigan St
- Florida St
- Texas Tech
- Ole Miss
- Kansas State
- Utah St
- Iowa St
- Virginia Tech
- Michigan St
- Florida St
- Texas Tech
- Iowa St
- Michigan St
Pick: Patriots 27-26
Spread: Rams +2.5
Total: Under 56.5
Unrelated to this post: What time does the Super Bowl start? 5:30pm Central.
Moving on, below is a 2d histogram of frequencies of the last digits of the final score of ever NFL game from 1920 through last year’s Super Bowl.
If I only use games from 2000 through the 2018 Super Bowl the 2d histogram looks like this.
Here are my picks for the NFL playoffs. Also, Ravens-Chargers shouldn’t be a first round game.
Texans over Colts, 26-22
Chargers over Ravens, 22-21
Cowboys over Seahawks, 21-20
Bears over Eagles, 27-18
Chiefs over Chargers, 28-25
Patriots over Texans, 24-21
Saints over Cowboys, 30-18
Rams over Bears, 29-22
Chiefs over Patriots, 29-27
Saints Over Rams, 29-27
Saints over Chiefs, 30-27
The Bears just clinched the NFC North for the first time in, I want to say, 100 years, by beating the Green Bay Packers last weekend at Soldier Field. Their week 15 meeting was the second time these division rivals have played this season and their first meeting came way back in week 1 when Chicago blew a 20 point lead and they looked well on their way to a 5-11 season, while Aaron Rodgers looked like Superman. But that was a long time ago and everyone seems to have caught up to the idea that the Bears are good this year and Green Bay is not. And you can see this in the spreads for the two games.
In the first meeting at Lambeau Field in week 1, the Packers were 6.5 point favorites over the Bears, who covered despite losing in crushing fashion. 14 weeks later and spread for the Bears-Packers game at Soldier Field was Bears -5.5. That is a shift of 12 points.
Now some of this has to do with home field advantage. If two teams were essentially equal on a neutral field, you’d expect this difference to be about 6-ish (-3 at home and +3 away). But 12 seemed rather large to me, and I wondered if that was the largest shift in spreads in a rematch this year. While it is not, in fact, the largest, it is close. There were two matchups that had a larger shift in spreads. Stop reading and try to guess what those match-ups were.
Ok. You ready now? The largest difference this year was the Titans and Jaguars. In September the home Jaguars were favored by 10 over the Titans in week 3. In week 13, The Titans at home were favored by 5.5, for a 15.5 point swing. Coming in at number 2 was Atlanta and New Orleans. In their first meeting in week 3, the Falcons were favored by 2. In week 11, the Saints were favored by 12.5. The previously mentioned Bears and Packers came in at 3rd largest with a shift of 12.0.
The only other two double digit shifts were Buffalo-NY Jets and Dallas-Philadelphia. The Bill vs Jets shift happened in only 4 weeks. In week 10, the Jets were favored by 7, then in week 14 the Bills were favored by 4.5 . Rounding out the top five was the Cowboys and Eagles. In week 10, the Eagles were favored by 7 points then by week 14 the Cowboys were favored by 3.5. Here is the list of all of the shifts of at least 5:
- Jacksonville – Tennessee: 15.5
- Atlanta – New Orleans: 14.5
- Chicago – Green Bay: 12.0
- Buffalo – New York Jets: 11.5
- Dallas – Philadelphia: 10.5
- Kansas City – LA Chargers: 7
- Dallas – Washington: 6
- Miami – New York: 6
- San Francisco – Seattle: 5.5
- Baltimore – Cincinnati: 5.5
- Cleveland – Pittsburgh: 5.5
- LA Chargers – Oakland: 5.0
I’ll follow up on this when the season ends, and I also want to go back and look at past seasons.
The Ringer published an article today entitled “The NFL’s Analytics Revolution Has Arrived” by Kevin Clark. The first section of the article is a relatively interesting overview of the state of advanced analytics in the NFL. But then everything goes down hill. And where does it start to go down hill? Right here:
“It is amazing,” Warren Sharp said, “how many teams anonymously follow me on Twitter.” Sharp is an engineer with his own analytics site and has been playing around with football statistics for about 20 years. He is among the top minds in football not working full time for a team.
Ok. First of all, why does this read as a press release promoting Warren Sharp? Second, let’s talk for a second about who Warren Sharp is. You might remember him from this blog post (which was picked up by Slate, the Wall Street Journal, and Huffington Post) about how the “The New England Patriots Prevention of Fumbles is Nearly Impossible”. It turns out that the analysis was highly flawed, and myself and a colleague detailed the problems with the “analysis” over at Deadspin and Neil Paine over at FiveThirtyEight.com did a great job summarizing the whole kerfuffle.
Sharp then basically claimed that he had been redeemed by the Wells Report, but that was also not true either. In fact, in 2015 immediately after the league implemented stricter ball handling procedures to prevent potentially deflating footballs, the Patriots still had the lowest fumble rate in the league. As Mike Lopez explains in Sports Illustrated:
In any case, the 2015 season makes for an excellent out-of-sample test with respect to New England’s fumble tendencies. Although the Patriots have been accused of going crazy lengths to gain a winning edge, it seems safe to assume that any suspect ball routine could not have been a part of the game-day preparation process this season. (The NFL implemented new procedures for inspecting game balls.) As a result, if one initially made the link between the Patriots low fumble rates and deflated footballs, the natural follow-up would be to assume that New England’s fumble rates would revert toward the league average in 2015.
So what happened in 2015?
• The Patriots had the fewest fumbles of any NFL offense.
• The Patriots had the best fumble rate of any NFL offense.
• The Patriots had one of their best fumble rates of the past decade.
Based on only this, it is my opinion that Warren Sharp is really not that great of a statistical analyst. And look, I make mistakes. Everyone makes mistakes. Its basically impossible to do statistics without ever making a mistake. Humans are human after all. But what bothers me so much about Sharp is that he just seems to ignore the legitimate criticisms and doubles down.
But wait, there is more! In addition to this, Warren Sharp is a tout. While The Ringer generously promotes his site, Sharp Football Stats, they don’t seem to mention his other site, Sharp Football Analysis, where Sharp sells football picks to gamblers. (You can buy a season long membership for the low, low price of $250….) According to Sharp, his record, shown below, is a 59% winning percentage over 12 years, with a whopping 77% win percentage in Overs (which is somehow different than “Over Leans”).
When something seems too good to be true, it usually is. There is absolutely no way he’s correctly picked 59% of games against the spread over the course of 12 years. And here’s how you can tell this isn’t real: If he was picking 59% correctly over the course of 12 years, he wouldn’t be selling the picks. He wouldn’t need to because he’s be extremely wealthy and wouldn’t need your $250 membership fee. There are a few very good professional gamblers, but you’ve probably never heard of them (Like Bill Benter, for example), and they certainly wouldn’t be selling their picks if their picks were any good because they could be making way more money betting on them (Benter made a BILLION dollars….with a “B”!). So his numbers are probably not the most truthful……
In fact, Game Advisers, which tracks handicappers plays, has Warren Sharp as 16-23-1 for a negative 23.41% ROI. Not quite the same as what Warren claims.
Also, apparently, he pissed someone off enough for them to start http://sharpfootballanalysistruth.blogspot.com. The blog has exactly one post:
One of the links in that blog post links to an entire thread about how Warren Sharp is a scam. A poster named Dr. H refers to him as a “sleazeball hack”……..his words, not mine.
And finally, a public service announcement from one of the covers.com forums:
So anyway, my point is that Sharp is a tout who does, at best, sloppy statistical analysis. And yet these major media outlets are touting (see what I did there…?) him as this genius. He’s not.
Anyway, back to that quote from The Ringer article. That paragraph continues:
In fact, when you talk to people inside the league, some think he might be the top mind, period. Though he’s been writing on the internet for many years, he said it wasn’t until 2018 that teams started reaching out to him to discuss analytics. He says he’s heard from at least five and has done work as a consultant.
While I haven’t personally asked anyone I know who works for an NFL team, I would bet everything I own that exactly 0% of the data scientists/statisticians working for NFL teams would consider this guy to be the “top mind, period“. And if I’m wrong about that, I can just take a page out of Warren Sharp’s playbook and lie about my record……..
P.S. They also mention my old friend Bill Barnwell (who is still blocking me on Twitter) in this article. I actually enjoy reading Barnwell’s stuff, but he also wrote this article once, which was a really poorly done statistical analysis for Grantland. You can read all about the shortcomings of that analysis here and here.
In my original post I fixed a few parts of the code (the white bishop was missing…derp) and I made the border lines thicker. I’ve also found that these look way better when only using the first 40-50 moves or so. Beyond that they get really boring. So here are all 12 games and the 3 tie break games using only the first 50 moves (25 white, 25 black):