Author Archives: statsinthewild
I’ve converted FiveThirtyEight win probabilities into field goal yardages.
So for example Arizona – 54.75 means that the probability that Biden wins Arizona is about the same as a kicker in the NFL making a 54.75 yard field goal. Let’s look at the swing states in terms of how likely Biden is to win the state
- Arizona – 54.8
- Colorado – 40.5
- Florida – 52.8
- Georgia – 63.6
- Iowa – 65.4
- Maine (Statewide) – 45.5
- Michigan – 43.2
- Minnesota – 47.0
- Nevada – 45.5
- New Hampshire – 48.9
- New Mexico – 35.29
- North Carolina – 57.9
- Ohio – 61.5
- Pennsylvania – 49.4
- Virginia – 33.0
- Wisconsin – 48.9
The longest field goal in NFL history was 64 yards.
Biden winning Oregon is a 35.3 yard field goal
To put this in perspective, Biden winning California and Oregon are about 12 and 35.3 yard field goals respectively. Biden winning Indiana and Wyoming are about 85.4 and 102 yard field goals, respectively.
Finally, overall Biden’s chances of winning right now are about a 48.9 yard field goal. Sometimes kickers miss 48.9 yard field goals!
So, a while ago I tweeted out asking for podcast ideas. Didn’t really think I’d get anything too interesting back, but David Hess had a great suggestion: I go to a random wikipedia page and then just see where it goes. I think this is a brilliant idea, and I’m stealing it for myself. I’m calling this “Rabbit Hole” and it will be broadcast live on Twitch. It’s perfect for me because it takes absolutely no planning and I can just do it whenever. I did it for the first time last night. (Full video is here). Here is a recap of last nights episode.
We started here: Tideglusib
Tideglusib is a GSK-3 inhibitor. Which apparently has something to do with Bipolar disorder. This made me looks up why they changed the name from “manic depression” to “bipolar disorder”
That page led me to…..the Animaniacs.
Where I watched the 50 state capitals video and the Nations of the World video. In that second video they sing the lyric “both Yemens”. Apparently from 1967 to 1990 there were two Yemens! I had no idea!
Note: Animaniacs aired from 1993 to 1998. So the song was WRONG.
I then wondered who did the voice of Yakko. It’s a guy named Rob Paulsen. He was also the voice of the Teenage Mutant Ninja Turtle Raphael! He also the guy from this classic “Got Milk” commercial! (He was a Hamilton fan before it was cool……)
Then this picture happened and I needed to look up Anthrocon.
Here is the Fursuit parade from Anthrocon 2019. It’s…..something.
So then I started reading about furries, as one does…..and apparently, there is some actual academic work being done studying furry culture. I found and started reading this paper from the journal Animals and Society. Which led me here when I was trying to figure out the difference between pan-sexual and omni-sexual.
In that furry study, they were doing some statistical tests to see if there were differences in proportions of gay, straight, and bisexual people who are involved in furry culture. What they tested was if all the proportions were the same. I didn’t think that made sense and it would have been better to compare to the proportions in the general population. So I went and looked up what the proportions are in the general population. Interestingly, Americans tend to over estimate the size of the U.S. gay population. (Guess what it is an see how close you are. I wasn’t too far off, but I did over estimate it).
This led me to thinking about how you estimate the size of hard to reach populations and I remembered seeing work done in the past about leveraging social networks to estimate population size. Here is some work by Tyler McCormick on this exact topic: Network-Based Methods for Accessing Hard-to-Reach Populations Using Standard Surveys.
Then the baby started crying. I’m definitely doing this again. I had a lot of fun doing it. I don’t even care if you watch. But I’d prefer if you did.
This is a work in progress. Who else belongs on this list?
- Dr. Phil
- Clay Travis
- Warren Sharp
- Alex Berenson
- Jerry Falwell, Jr.
- Gwyneth Paltrow
- John Edward
- Jim Carey
- Ben Shapiro
- Sean Hannity
- Bill O’Reilley
- Jenny McCarthy
- Donald Trump
- Eric Trump
- Donald Trump, Jr.
- Ivanka Trump
- Jared Kushner
- Mike Pence
- Robert DiNero
- Robert F Kennedy, Jr.
- Practitioners of cupping
- Practitioners of acupuncture
- Flat Earthers
- Faith Healers
- Holocaust Deniers
- Men’s Rights Activists
- Proud Boys
- Richard Spencer
- Gavin McInnes
- Rush Limbaugh
- Alex Jones
- 9/11 Truthers
- Darrren Rovell
- Rovell, D’Nesh D’Souza
- Climate Change Deniers
- James Woods
- Chuck Woolery
- Laura Ingraham
- Tomi Lahren
- Shiva Ayyadurai
- Kevin Hassett
- Joe Walsh
- Richard Epstein
- James O’ Keefe
- Ryan Fournier
- Mike Cernovich
Group A, dubbed the “group of death” by basically anyone who closely follows sports analytics Catan players, saw @chuurveg win the first two games of the group making the third game nothing more than a victory lap for @chuurveg. Though he almost pulled off the sweep in round one, @sethwalder pulled out a victory in game 3 to advance to the round of 16.
@jacksoslow and @yuorme took all the wins in this group with the only other competitor finishing with a second @scott_bush.
@dangroob tore through group C posting 3 wins in his 3 games. @chibearsstats also advance on the back of 3 second place finishes. Fan favorite @statsinthewild disappointed throughout the first round. Needing a win in his third game to advance, all he could manage was 5 points and a tie for second. He’s going to have to answer a lot of questions about whether he even deserved to be in this tournament in the first place.
@cfbnate and @hausinthehouse advance out of this group in what was probably the most exciting group of the first round. Both of the first two games of this group saw ties for second at 9. After two games @stat_ron was sitting on 18 victory points with no wins. He then finished dead last in game three to not advance. His 23 victory points in the first round are the most of anyone who did not advance (tied with @kfloyd34 from group H).
Group E saw three different winners in three different games. @zbinner_NFLinj one the first game of the group and, like an injured NFL player referenced in his screen name, limped across the finish line to somehow win the group with only 22 victory points. @mtm1013 won game 2 and after his 2nd in game 1, he was very likely to advance to the next round barring some craziness in game 3. Game 3 saw @chiefsanalytics, whose Twitter name is not @arrowheadanalytics, went big in game 3 for the win, but it wasn’t enough to advance. Kansas City will just have to be happy with their Super Bowl win this year because the city won’t be bringing how a Settlers of Catan Championship this year.
Wow. What a wild ride this group was! @punkrockscience wins the group after putting up a 3 spot in game 1. Even more bizarre, she won the group with the lowest victory point total of the group (21, 22, 23, 23)! She is the only competitor to win their group and NOT have the most total victory points. Additionally, @sabinanalytics waited until the very last minute to advance by winning game 3 leaving game 1 winner @nflsharptampa as one of only two people to win a game and not advance (@chiefsanaltics is the other). @necesarypaper, who ultimately finished 4th in the group, was actually in really good shape to win the group going in to game 3 as a win in game 3 would have put him in first as he had finished second twice already and would have moved him into first place all alone with a win. @neccessarypaper ends their tournament as the only player to score 7 or more points in all three of their games and not advance out of the first round.
With one game left to go in Group G, @jPohlkampHart has already won the group, but that coveted second place is the last spot available for the round of 16 and any of the other three in the group (@msubbiah, @dtrainor4, and @nickwan) can still claim it. Game 3 will be win an advance for the three unqualified contenders. If @JPohlkamHart can win game 3, @msubbiah will advance unless @dtrainor can finish alone in 2nd AND make up the 3 victory deficit behind @msubbiah. @nickwan, the worlds greatest Catan streamer, needs a win and only a win to advance.
Last, and kind of least, group H featured @sethpartnow dominating the first two games of Group H, making the third game in the group a exhibition game for Seth. The second spot was up for grabs with a win an advance scenario for the other the other three in this group. Ultimately, @Hharris3419 took the other spot with a 10,9,9,6 win in the final game to move on.
Here is a summary of the first round (color is what place they finished in the group (gold, silver, bronze, and red)):
Round of 16
Group I features the winners of group A and B and he runners up in groups C and D.
Group J features the winners of group C and D and he runners up in groups A and B.
Group K features the winners of group E and F and he runners up in groups G and H.
Group L features the winners of group G and H and he runners up in groups E and F.
Here is the email that I just sent to Bob Pittman, CEO of iheartmedia.com, and Eric Shanks, Fox Sports CEO. I encourage you all to send emails to Bob and Eric expressing your dismay for giving Clay Travis a platform to spread this type of dangerous misinformation about coronavirus. You can find their emails here:
Bob and Eric,
I’m writing to you today in regards to Clay Travis. Recently, Clay has been writing articles (such as this: https://www.outkickthecoverage.com/coronavirus-infections-are-likely-to-peak-next-week/) related to Coronavirus that contain misinformation about the pandemic. Given his large readership, this is incredibly irresponsible of your companies to give him a platform to spread dangerous falsehoods about the disease. When he makes “hot takes” about sports, it doesn’t matter whether he is right or wrong. But this is literally a matter of life or death. As a result, I’m asking you to please stop giving Clay an outlet to spread misinformation about the worst crisis of our lifetimes. Let him write about politics or sports or whatever else he wants. But please stop giving him an outlet to spread his Coronavirus falsehoods.
For reference, I’ve written a blog post with everything that Clay got wrong in his article from March 18 where he seriously downplays the risk of this virus: https://statsinthewild.com/2020/03/30/clay-travis-said-another-stupid-thing-and-this-time-it-actually-matters/
Gregory J. Matthews, Ph.D.
Director of Data Science Program
Department of Mathematics and Statistics
Here is a tweet I saw today. It’s from April 7. It’s from Tomas Pueyo, the author is this medium post on Coronavirus from March 10. (Tomas Pueyo is not a public health expert). When I first read this, I thought, wait that doesn’t seem right and I tried to reproduce his numbers. To get to 36%, Pueyo is simply dividing 1.8/5 = 0.36. But is that right?
What Pueyo is trying to do here (I think) is calculate the conditional probability that someone dies given that they are in the ICU. That is:
By the definition of conditional probability, this is equal to:
So based on this calculation, Pueyo is implying that P(Death and ICU) = 0.018 and P(ICU) = 0.05. However, he is using an estimate of P(Death) rather than the correct probability of P(Death and ICU) in his numerator. While these numbers may be close, they are almost certainly not equal to each other:
The only way that these two numbers would be equal to each other is if P(ICU|Death) = 1. This seems unlikely. There are certainly people dying from Covid-19 that never make it into the ICU.
So even if Pueyo’s number are completely correct, this calculation is almost certainly an over estimate based on basic intro probability rules.
And I’d like to add two additional points:
- We don’t actually know what the mortality rate is of Covid-19 because we don’t have accurate counts of either the numerator (how many deaths) nor the denominator (how many cases) in a mortality rate calculation. So this number has a ton of error in it.
- We already know that age is a huge predictor in mortality rates for Covid-19, so to not include that in the calculation of probability of death of a specific individual of known age makes no sense.
So basically this entire calculation is meaningless. So my advice is let’s stop playing armchair epidemiologist, especially if you can’t even get very basic probability rules correct!
On March 26 Slate posted an article by Tim Requarth entitled “Please, Let’s Stop the Epidemic of Armchair Epidemiology” (which I should mention, convinced me to stop posting my COVID-19 analysis on this very blog). One of the posts mentioned in the Slate piece was this piece by Abe Stanway on March 14 called “Real Time COVID-19 Tracking”. There has since been some back and forth between Requarth and Stanway on twitter, and I decided to take a look for myself at Stanway’s blog post. The post has been reproduced below with my comments in bold. (tl;dr This isn’t a very good article. It’s extremely ad hoc and riddled with statistical shortcomings).
Note: Two things: 1) I don’t want people to stop blogging. I’m not trying to “dunk” on anyone here. (I save that for Clay Travis). Everyone should write more, including Abe Stanway (who seems like a really, really smart guy)! But this topic is serious, and there is so much misinformation out there. Please, please, please, get your information from the experts. I know it may seem like everyone out there can do this stuff, but it’s really complicated. So please, leave it to the experts.
2) Again the title data scientist rears it’s head. The author of the article calls himself a data scientist, but he seems to be lacking some very basic statistical skills (based only on what I’m seeing in this article, so I could be completely wrong). I’m sure Abe Stanway is a very talented (I mean this is straight up really impressive), but I think what I would define as a data scientist and how he interprets the title are very different. But I suppose that is a discussion for a different blog post.
Real Time COVID-19 Tracking
EpiQuery is a realtime “influenza-like illness” (ILI) tracker. It’s updated on a daily basis. The system was set up in 2016 to track emergency room visits with chief complaints that mention flu, fever, and sore throat. These are not confirmed nor denied to be influenza, nor any other disease. Meanwhile, the US government has severely bungled the COVID-19 test rollout, and we’ve only tested ~15k people total so far. In the absence of widespread testing, we need to rely in EpiQuery (and ILINet, the federal CDC version covering all 50 states) to understand the likely growth of the COVID-19 outbreak.
I agree that the US government has bungled the COVID-19 test rollout.
I wouldn’t use the word NEED when saying “need to rely on EpiQuery”. At this point in the article I’m already skeptical. I think the author is speaking with too much certainty already. On March 14, even now, there is so much we don’t know about this disease. We still seem to be learning new symptoms.
I’m going to nit pick here a bit, but I wouldn’t look at the day with the single highest number of cases and call that the peak of flu season. That’s likely noise. It would be better to smooth the data using a moving average possibly and then look for the peak of the moving average.
The chart above shows daily ILI ER visits in NYC. The seasonal peaks are highlighted in pink. We see a very seasonal pattern in the data — every year, there’s generally one major peak in December or January, followed by a gradual decrease in ILI cases. This is the annual flu season, visualized.
This is the same data, but zoomed in on 2020. We see our normal seasonal peak on January 29th, and then we see a marked anomaly starting at around March 1st. The anomaly displays a peak of equal magnitude to the regular seasonal peak.
There does appear to be something going on here, but what I would like to see is some of analysis that this is a real signal and not just noise in the data. It’s not enough to just say it looks like something is happening.
A double peak flu season appears to be exceedingly unlikely, as it has never occurred in any historical flu season since the start of this data (at least in NYC), nor has it ever occurred with a slope of this magnitude. Therefore, I believe a large percentage of this peak indicates COVID-19 ER visits in NYC, and not nominal flu visits.
This is where this all starts to fall apart. He uses the term “exceedingly unlikely” to describe the probability that this is just a normal flu. What is that based on? 4 flu seasons in NYC. That first sentence starts out making a really strong claim and it just gets weaker and weaker……”it has never occurred in any historical flu season”…….WOW that seems convincing………..”since the start of this data”………….Oh, well that’s only 4 years……….”(at least in NYC)”…………….and it’s only 4 years of one city. So to say that this is “exceedingly unlikely” to be just a normal flu isn’t supported by anything. We have no idea what other flu seasons have looked like based on this data.
Also, are double peaked flu seasons rare? Nope. According to William Schaffner, MD, an infectious disease specialist at Vanderbilt University in Nashville:
We may well have, for the second year in a row — unprecedented — a double-barreled influenza season
The 2018-2019 season has been unusual, though, because the flu came in two waves: one that peaked at the end of December, and a second that peaked in early March. The two peaks were caused by two different strains of the flu virus, and the protection given by vaccination early in the season may have waned by the time the second strain appeared.
So literally, just last flu season we saw a double peak season. So, again, to describe the possibility that this peak is just regular flu is “exceedingly unlikely” is just not a statement supported by any data.
As a side note, I wonder what other diseases are out there that cause influenza like illness (ILI) symptoms besides actual flu and Covid-19? I have no idea.
Another note, this peak could very well be COVID-19. But there are other possibilities to explain this peak that the author has basically ignored. The most notable being that he claims it can’t possibly be regular flu when literally last year there was a double peaked flu season where the second peak with in early March. So I’m not convinced that this is actually a COVID-19 spike, but it could be.
The fact that the peak starts at around March 1st, and the fact that this was also the date first confirmed case of COVID-19 in NYC, lends further evidence to support that this spike represents COVID-19 cases.
Yeah, but it could just be regular flu. The author is speaking with too much certainty.
The data above represents daily ER visits. This means that since March 1st, there have been 8,000 cases of ILI-based ER visits in NYC. Subtracting the nominal flu season data (~3,800 cases over this period, assuming a late season R0 of .95), that means there are likely a minimum of 4,200 COVID-19 cases in NYC as of March 12th.
Ok, this is as armchair-y as it gets. Where does the 3,800 number come from? Where does the R0 = 0.95 come from? Is that a widely accepted value? No work is shown how the author arrives at 4,200 COVID cases. The assumptions that the author is making at numerous and substantial. For starters, he is assuming that every single one of these “excess” cases is COVID-19 (again ignoring the fact that there was a 2 peaked flu season just last year). Perhaps the most head scratching part of this whole thing comes in the next graph.
This analysis should be considered a napkin sketch — a more detailed study could estimate the precise start date in NYC, knowing the R0 of COVID-19 is estimated to be 2.2¹ and working backwards to infer when when Patient 0 actually arrived based this parameter and the current curve in red.
He does claim that this should be considered a “napkin sketch”. So he is giving caveats. But then why even do this? Why spend the time on this? Is it worth it to spread “napkin sketch” information about a pandemic? I lean towards no.
He does give a citation for 2.2 R0 for COVID-19, which I appreciate.
The true number of COVID-19 cases in NYC is likely several times higher (given the fact that not all cases present to an ER, and ER cases that are not admitted are sent home without any proper quarantine protocols — aka they are sending people home in Ubers or subways), but I will refrain from speculating on an exact number until I find more data. However, assuming the exponential curve holds, the current case count as of March 14th is around 6,300. Despite the napkin math, this data indicates that NYC is currently adding around 1,000 ER admissions of COVID-19 per day and growing fast.
I think this is a loose usage of the term “exponential curve”. He doesn’t check at all if the curve is actually exponential. This has a formal mathematical meaning, and the informal use of the term “exponential” often means it’s just growing “really fast”.
Also, “adding 1000 ER admission per day” wouldn’t be exponential growth. That’s linear growth.
Finally, where does 6,300 come from? Is this an extrapolation? I don’t follow where that comes from?
BONUS (or, this is where it gets weird):
Below is a breakdown of the cases by neighborhood.
The epicenter appears to be somewhere in Queens.
This is a neighborhood called Corona. You just can’t make this shit up. Edit: yes, I’m fully aware Corona is *always* the highest density of ILI symptoms. This is likely due to the concentration of hospitals in the area. Regardless, this is a joke, and if you take it seriously, you should get out of the house more (once your isolation period is over, of course!)
If this is a joke, it’s hard to tell. And it’s and odd place to put a joke. The entire article is serious in tone, and then the author just throws this in there at the end and claims that this is a joke? I don’t know man. In a different context, maybe it would be more obvious that this is a joke. But it’s not obvious to me that it’s a joke.
Also, looking at raw cases means pretty much nothing. Gotta account for baseline population levels.
All data are available for analysis here. Additional data nationwide can be found on ILINet FluView. Thank you to Ben Hunt for discovering this trove of data, and to Dr. Alfred Illoreta of Mount Sinai and Dr. Ydo Wexler of Amperon for reviewing drafts of this post.
I recently wrote a blog post explaining just how full of shit Clay Travis is about Corona virus. On March 18, he wrote an article (which I won’t link to here because I’m not giving him clicks) where he said:
Loss of life will be in the thousands, at most, and not the tens of thousands or the hundreds of thousands or the millions as the most terrifying of these forecasts have suggested.
As of right now, according to Johns Hopkins there have been 9,619 deaths related to coronavirus with FOUR straight days of 1000+ deaths. So we should hit 10,000 today, which would make Clay Travis’, who let me remind you is wildly unqualified to make projections like this, unequivocally wrong. As a result he apologized to his readers for misleading them and said he would do better int he future.
JUST KIDDING! Of course, he didn’t do that. He tweeted this today:
Clay. Clay. Let me be very clear about this. I’ll speak slowly because you don’t seem very bright. The millions of deaths models were based on the assumption that we did NOTHING to stop the spread of the disease. But we did. We’ve been socially distancing, against your advice, mind you, for a few weeks here in Chicago. And a bunch of other places have been doing it too. Most of America in fact.
And remember, the current projection for number of deaths is between 100K and 240K…..IF we do everything nearly perfectly.
So. Clay. I know you truly believe that you are right. But you aren’t. You’re intentionally being a moron. Please stop.
He followed that tweet up with this gem:
Clay Travis should be the first one to get one, because the reported deaths are still growing exponentially (It’s possible that the growth rate of reported deaths is slowing down ever so slightly, which is good but definitely not a trend yet). And that’s REPORTED deaths, which are almost certainly an undercount of the total deaths from Covid-19. Just take a look at the reported deaths graph below on the log scale. We are going to go crashing through that limit, literally TODAY, where Dr. Travis, M.D. said we’d never get to:
The moral of the story, as always, is that Clay Travis is full of shit. He doesn’t know what he is talking about in regards to coronavirus. He also doesn’t know what he’s talking about when it comes to football either. But in that case it doesn’t matter. It really matters in this case. So stay safe and stay at home. And stop listening to Clay Travis.