Last night, I decided to try to get some polling data and Huffington Post makes their polling data available through a very easy to use API in JSON format (GitHub code here).
This first plot uses the national polls of Trump vs Clinton. All polls that were conducted on “likely” or “registered” voters were included. Next I computed the weighted moving average of each of these polls using different moving average windows from 1, 2, 3,…, 21 days. I then plotted all of these curves on top of one another with the width, transparency, and color related to how many days were considered in the moving average. The more days included in the moving average the wider and more opaque the line is and the redder/bluer the line is. I then plotter three different confidence bands using the 7, 14, and 21 day moving averages.
I then pulled out all the state polls that were available and computed the weighted average across all polls with “likely” or “registered” voters (I did not consider the timing of the polls). I then computed a mean and standard error for each of these estimates and randomly sampled from the distribution for Trump and Clinton and plotted these random samples on the plot. The wider the spread of the plotted points for each state the fewer people have been polled in that state. So for instance, Utah has had more polling that Idaho. The color is related to what percentage each candidate is receiving in the poll (redder for Trump and bluer for Clinton. I’ve also added lines with negative slope to show what share of support third party candidates are receiving. If you follow the line y=x the states closer to the origin are more receptive to third party candidates. So for instance, Utah and Idaho are giving a lot of support to third party candidates whereas Georgia and Florida are mainly voting for the two major candidates.
“There has been a great deal of hype surrounding neural networks, making them seem magical and mysterious. As we make clear in this section, they are just nonlinear statistical models, much like the projection pursuit regression model described above.” – Page 350, “The Elements of Statistical Learning”,Hastie, Tibshirani, Friedman.
JSM 2016 was last week in Chicago (the greatest city in the world). As always, it was awesome.
This year I was delighted to get to put together the first (hopefully annual!) JSM Data Art show. This year’s inaugural show featured 6 artists: Marcus Volz, Alisa Singer, Craig Miller, Jillian Pelto, Elizabeth Pirraglia, and me (Gregory J. Matthews)
The first session I attended was also the session where I gave my talk on Sunday at 4. The title of the session was “For the love of the game” and featured speakers Dan Nettleton, Brandon LeBeau, Douglas VanDerwerken, Thomas Fullerton, and myself.
Dan Nettleton (with Dennis Lock) spoke about whether or not to go for it on 4th down in the NFL using a random forest model. They used this model to make 4th down decisions based on maximizing win probabilities. Details of their model can be found here and the full paper is here. The examples they used in their talk were from the 2016 AFC championship game where Denver beat New England. Based on their model, the Patriots could have increased their win probability by choosing to kick the field goal rather than go for it on 4th down in both instances.
Dan Nettleton with Dennis Lock – To go or not to go. 4th down analysis. Using Random Forests. N = 430168 plays. 12 years of data. Belicheck should have kicked in the Pats Broncos game both times when they went for it on 4th down near the end of the AFC championship game. Interesting point of do you want to win more in the long run or do you want to win THIS game.
Brandon LeBeau (@blebeau11) spoke about hiring decisions in college football using an item response theory model. Based on this model and measuring success of a coach in terms of expectation of team strength, he concludes that Tim Brewster was a terrible hire for Iowa State. Also a terrible hire? Brady Hoke.
Douglas VanDerwerken spoke about soccer suspensions in the English Premier League (EPL). He found that fouls were reduced by 12% and 23% when players were facing a 1 and 2 game suspension, respectively. He also noted that they found evidence that players foul more often when refs give fewer cards, but also when refs give MORE cards. This may seem surprising at first, but I think what’s happening here is that when ref’s give fewer cards players are more likely to foul more because they know they can get away with it. And when a ref gives MORE cards, they are also probably calling more fouls. I think an interesting extension of this would be to treat the ref’s call like the result of a medical test and have some panel of experts decide whether something was actually a foul. Then you could better measure the actual rate of fouling and the actual rate of fouls being called. Right now those two effects are confounded with each other. This might be tedious and I’m not even sure you you could get experts to agree on what is an objective foul.
Tom Fullerton spoke about college football ticket sales at UTEP. He found many interesting things, but I thought it was particularly interesting as a coach coaches more games, ticket sales on average go down.
I spoke about statistical disclosure with respect to Baseball Hall of Fame voting data. You can find my full slides here. My actual slides at JSM were a much more condensed version of this talk and I actually managed to finish my talk early. It is the first time I have ever finished a talk early.
On Monday morning, I showed up late to the session “Advanced Methods for Statistics in Sports“. I got there in time to see Andrew Swift talk about modeling sports outcomes using a ratio based point differential. His slides can be found here. He also used something called the Farey sequence, which I spent some time reading about. Interesting. Scott Powers (@saberpowers) presented some work with all-stars, Trevor Hastie and Robert Tibshirani. His method, nuclear penalized multinomial regression (npmr) (R package here), was applied to outcome in baseball. They view the outcome of a plate appearance as a multinomial outcome and multinomial regression is a natural choice in this setting. The key to their penalty is, rather than using the Froebenius norm, they penalize the rank of the matrix of coefficients. The resulting estimates can then be interpreted as latent “skills”. So really instead of estimating 9 different outcomes, they are really estimating something like three latent tools or skills. For batters, the three skills work out to be basically the three “true” outcomes (i.e. plays where no fielder is involved). These are K, HR, BB, and HBP. The second skill is power, which is characterized by more flyballs and HR. The third batter skill is plate discipline (i.e. more walks and less K). For pitchers, there are also essentially three skill: contact avoidance (More K and BB), trajectory skills (more singles and ground balls) and command (more ground balls). Scott conclude that in the context of baseball, this method is not significantly better than ridge regression, however, the resulting interpretation of the output of npmr is much more interesting than ridge regression.
(Weird story. I met Powers at SABR. Didn’t realize he was the npmr author and used npmr for a completely unrelated application (fossilized tooth classification). I then emailed him a question about the package, and he responded very quickly with the answer. Then like a few weeks later when I was looking at JSM program for talks I wanted to see, I realized that he was the author of npmr and that I had already met him. #smallWorldStats)
Next, Sameer Deshpande spoke about pitch framing using hierarchical Bayesian logistic regression (The slides are here) using PitchFX data from 2011-2015. In aggregate, the best catchers in terms of framing are Montero, Zunino, Lucroy, Rivera, and Rene. However, these results are related to the number of opportunities a catcher gets. When this is removed you get a statistic, which they refer to as SAFE2. The best catchers in terms of SAFE2 are Rivera, Conger, Vazquez, Montero, and Grandal. After than you have players like Zunino, Maldonado, Stewart, Martin, and Butera.
Stephanie Kovalchik (@statsonthet) was supposed to speak in this session, but her flight from Australia (!) was delayed and she wasn’t able to make her talk. However, you can find the slides from her talk here.
Monday night I was supposed to go to the UConn alumni dinner, but I never made it as I went to the Statistics in Sports section meeting before and just couldn’t get my act together.
Optimizing Fanduel in R
I want to give a little background on my experience with fantasy football first. If you’re just looking for the algorithm just skip down to the section titled The Algorithm.
I like Chicago Bears football. Even during rough seasons like this one. I don’t really care that much about other football teams. So when my friends first invited me to play fantasy football back in 2006, I bought a $5 magazine with rankings of every player. I didn’t do well that year. So the next year I decided to create a drafting algorithm. At the time I was just getting started in R, so I wrote it in Excel. I won the championship that year.
I quit playing for a few years when grad school was intense and recently un-quit. I rewrote the majority of the program in R last year and completed the transition this year. I’ve won my division both years! I’ve been tempted to make it into…
View original post 3,936 more words
Sunday, July 31, 2016
12pm: Setup JSM 2016 Data Art show.
4-6pm: “For the Love of the Game: Applications of Statistics in Sports”—Contributed CC-W184d
Monday, August 1, 2016
“Advanced Methods for Statistics in Sports”— Contributed CC-W175a
“Contributed Poster Presentations: Section on Statistics in Sports” —Contributed CC-Hall F1 West
- Rating Offensive Production in Baseball: A Summer with the Martha’s Vineyard Sharks—Jesse McNulty, William Penn High School Sports Analytics Club; Tyler Schanzenbach, William Penn High School Sports Analytics Club
“SPEED: Statistics in Government and Engineering, Part 2B”—Contributed CC-Hall FWest
- Using Scores to Identify Small Cells in Tables for Disclosure De-identification
DataFest Meeting, Hilton
“Statistical Foundations of Data Privacy”—Invited CC-W195 l
Section of Sports in Statistics Meeting. The Scout.
UConn Statistics Department Alumni Dinner
So I’ve been following some PredictIt.org markets lately. Most notably, the Trump vs Clinton market. Clinton’s shares are currently trading in the $0.65-$0.68 range which means traders believe there is a roughly 65%-68% that Clinton wins in November. PredictIt.org also has state by state markets for the electoral college for around 30 states. So I wondered what would happen if I took these individual state markets and used them to predict the outcome of the electoral college in November.
PredictIt.org has a nice API platform that allows for easy access to their data and I wrote an R script to get all of the closing prices for the Trump vs Clinton state by state markets (Code is on GitHub. If you have a better way to parse the information that come in from the predictit.org API, I would like to know a better way. I hate how I did it, but I it seems to work fine). I then used the individual market prices as approximate probabilities the Clinton or Trump would win each state. When a market wasn’t available, I used a probability of 0 or 1 depending on who won the state in the 2012 election. (These are states like Idaho and Connecticut, we know what going to happen there.)
Based on these state by state markets (as of July 29, 2016), I simulated the election 100,000 times and estimate that Clinton has a 83.16% chance of winning, Trump has a 16.21% chance, and there is a 0.64% chance of a tie. (I will try to update this as often as possible this fall, though with a baby on the way in Septemeber (!!!) we’ll see how often than actually happens). 16% sure is a scary percentage though if you are absolutely petrified (as I am) of a Trump presidency. In Texas Hold ’em, 16% is about the probability of a pair showing up on the flop, making a full house or better by the river after flopping two pair, or flopping a guy shot and hitting it by the river. I’ve seen that happen way too many times to sleep comfortably at 16%.
Clinton’s probability here is interesting because the state by state markets have her at over 80%, but the overall market (Trump vs Clinton) has Clinton at under 70%. I personally believe the 80+% number (for the purposes of keeping my sanity) and think Clinton vs Trump the market is way under priced for Clinton. (I have invested accordingly). This 80+% number is also inline with the Bayesian model from Princeton Election Consortium, which currently has Clinton at 85%. (Their random drift model has Clinton at 65%).
FiveThirtyEight currently has their polls only model at 50.1% for Trump.
And their polls + forecast model has Clinton at 60.1%
I think these numbers are way too low for Clinton, and I think in general, FiveThirtyEight’s predictions are much too conservative in that they tend to be too close to 50%-50%. I wrote a blog post before the 2012 election about how many states we expect Silver to get wrong and there was only about a 5% chance that Silver would get every state correct. Based on his state by state probabilities, there was a very large chance he would get at least 1 state wrong. But Silver actually got all the states correct. Based on his results in 2012 and his current election predictions in 2016, I think Silver is too conservative in his election forecasts and he’s actually even better at it than his probabilities show. (Princeton Election Consortium had the probability of an Obama win at 99% on November 1, 2012).
Next I wanted to look at win probabilities given that a candidate wins a specific state. In order to do this, I took the state by state win probability vector and one at a time flipped each state to a probability of 1 for a given candidate and left all the other probabilities alone. I then simulated the election 50,000 times to look at Clinton and Trump win probabilities and the tables are below ordered by the single state that gives the candidate the highest probability of winning the election. For Clinton this state is Texas. With it’s 38 votes, if Clinton can win Texas (extremely unlikely, but possible; Houston Chronicle just endorsed her) she has about a 97% chance to win the election. The next four states for Clinton are Florida, Georgia, Tennessee, and Indiana, which if she can win any of these states give her at least an 89% chance to win the election.
|Given Clinton Wins:||Clinton Win||Tie||Trump Wins|
For Trump, the state that would raise his win probabilities the most are California and New York, both very blue states that Trump, for some reason, believes he has a shot at in November. If Trump could somehow manage to win California, his win probability jumps to 70% with a massive haul of 55 electoral votes. This is a very unlikely scenario, but nothing is impossible in this election. After New York, the next states for Trump are Illinois, Florida, Pennsylvania, New Jersey and Michigan. If Trump can win any of these states, his chances of winning go up to at least 23%, still a big underdog.
|Given Trump Wins:||Trump Win||Tie||Clinton Win|
Finally, I looked at what would happen if Clinton and Trump won different combinations of the three swing states Ohio, Florida, and Pennsylvania.
If Clinton wins all three of these states, she has about a 97% chance to win the election. If she wins Florida and Ohio or Florida and Pennsylvania her chances are about 95% and if she wins Ohio and Pennsylvania she would be at about 93%.
If Trump wins Ohio and Pennsylvania he has about a 35% chance to win, is he wins Florida and Ohio he has about a 37% chance to win, and if he wins Florida and Pennsylvania he has about a 43% chance to win. So even if he were to win 2 out of 3 of these states, he would still be an underdog. What happens if he wins all three? If Trump wins Florida, Ohio, and Pennsylvania, he has about a 60% chance to win the election. Think about that. For Trump to win Florida, Ohio AND Pennsylvania, things would have to go absolutely perfectly for him and there would still be a 40% chance that he loses the election. Trump has a YUGE (sorry) electoral college problem.
I want to write interesting things like Mike Lopez when I grow up.
I’ve long had my eye out for intriguing papers that cover my two favorite areas of research, causal inference and sports statistics. For unfamiliar readers, causal inference tools allow for the estimation of causal effects (i.e., does smoking cause cancer) in non-experimental settings. Given that almost all sports data is inherently observational, there would seem to be opportunities for applied causal papers to answer questions in sports (here’s one).
It was with this vigor in mind that I read a paper, the Midweek Effect on Performance: evidence from the Bundesliga, recently posted and linked here. The authors, Alex Krumer & Michael Lechner – the latter of which has done substantial causal inference work – use propensity score matching to estimate the effect of midweek matches on home scoring.
The authors conclude that:
Playing midweek leads to an effect of about half a point in total, resulting from the home…
View original post 802 more words
At the beginning of my post-doc, my adviser told me that she wanted me to write and R package. I told her I didn’t know how to do that, and she told me to figure it out. It’s really not that difficult and incredible rewarding. Even if you never put your package on CRAN, it’s a worthwhile exercise to see how R packages are made. And who knows, maybe you’ll be the next Hadley Wickham………
I have long wanted to write R packages. But for some reason, I thought that I wasn’t capable of such a feat. “R package authors are super stars, I’m just me,” I would think in a mix of despair and admiration, marveling at some new and useful R package, and wishing I could create such functional beauty.
But now the day I have long dreamed of has arrived! I have authored an R package, and it is perhaps the most satisfying feeling I have ever had using R. Trust that I have had many R-related feelings, so I do not make this statement lightly. On top of now holding myself in higher regard, I am also wondering what the heck took me so long?
View original post 190 more words
I recently wrote an article about Todd Frazier’s stolen bases for BP Southside, the Baseball Prospectus White Sox site, and in doing so did a decent amount of digging into the different advanced measures of base-stealing productivity—something that would take into account all the necessary components and spit out a measure of runs saved or lost. I got frustrated by a few things, and so decided to type this up, as I think it encapsulates a lot of issues in public sports analysis. All of this is written from a baseball perspective, but it applies at least as much to hockey and probably even more to basketball.
Before I start I want to say that the various sports stats sites strike me in many ways as emblematic of the promise of the internet. Vast amounts of cross-indexed information that can be used with minimal technical abilities, and synthesizing months and years of work done often…
View original post 1,492 more words