## Euro 2020: Just some random thoughts.

Below are Luke Benz’s Euro Cup 2021 predictions from June 10. You should all be following Luke.

Some thoughts on June 28:

- The Czech Republic only had a 15% chance of reaching the quarterfinals! And the Netherlands had the second highest probability of making the quarterfinals.
- Belgium will play Italy in the quarterfinals. That could have been a final. Meanwhile, as Luke pointed out to me, either Denmark or Czech Republic will end up in the semi-finals.
- England will not win this tournament.
- Totally unrelated to this, it’s always weird when I remember that GREECE won Euro 2004. GREECE!

## A Bayesian marked spatial point processes model for basketball shot chart

This paper from the December 2020 issue of JQAS is wonderful: A Bayesian marked spatial point processes model for basketball shot chart.

Simply put, the build a model looking at where players are taking shots and then given a location, how often are they making shots from those locations.

I’m particularly interested in this point from the paper:

The preferred models for all four players, which are

intensity independent model for Curryand intensity dependent model for other three players, can reduce the MSE by 2.7, 1.3, 2.0, and 7.0%.

I think the correct way to interpret this is that three of the players analyzed have different chances of making a shot based on where they shot is taken. But for Curry, the probability he makes a shot is INDEPENDENT of where he is taking a shot. Basically he’s just good everywhere. (If this is NOT the correct interpretation, let me know!)

I’d love to see the analysis expanded to all players in the league and see who else would end up with an intensity independent model.

Cheers.

## I made a useful shiny app (and you can too!)

I made a shiny app for organizing openmics in Chicago. (Yes, I’ve started doing stand up. No, I’m not good……yet.)

The code for making this app can be found on my github.

And the wonderful Shiny cheatsheet can be found here.

The real key for me to get this to work was the addition of the global.R file. I didn’t realize you could add this along with the ui.R and server.R files. I HIGHLY recommend the global.R file in your shiny apps. I’m going to use this as an example of a shiny app in my Data Science 101 course that I am developing and will teach for the first time Spring 2022.

Cheers.

## Aging Curves in Baseball

So I’m working on an aging curve in baseball research project with two of my students here at Loyola. While there has been quite a bit of work done on aging curves in many sports, including baseball, our question that we are interested in is this: What would the aging curve look like if players played every season from the age of 22 through 40. Because what we observe are the players who “survive”. What WOULD have happened if a player who was forced out of the league at the age of 30 had played until they were 40? We view this as a missing data problem and are currently using multiple imputation with a hierarchical structure to impute missing seasons and then estimating the age curves based on the imputed data. I’d like to do the aging curve estimation using functional data analysis, but……we’ll see.

Anyway, I’ve started doing some lit review for this and I figured I’d post some of the interesting articles that I’ve found related to the topic:

Albert (1992) looks at estimating models for home run rates and as part of this Albert incorporates an aging curve into his model. A quadratic form is assumed for aging curve.

Berry et. al (1999) incorporates an aging curve into their analysis, but instead of a quadratic form they use a nonparametric model. They looked at hockey, golf, and baseball. (Albert (1999) in a comment argues against the aging model presented in Berry et. al. (1999).

Fair (2008) looks at aging curves in baseball and follows from previous work that looked at aging curves in running, swimming, and chess. (Fair (2007)).

Wakim and Jin (2014) take a function data analysis approach to the problem and look at MLB and NBA. This is probably the most sophisticated statistical analysis that I have seen so far in regards to aging curves.

Dendir (2016) in the Journal of Sports Analytics looks at when soccer players peak and, based on their analysis, found that players in top leagues peak somewhere between 25 and 27.

Vaci et. al (2019) looked at aging curve in NBA players.

This is clearly not an exhaustive list of paper related to aging curve in sports, but it’s some of the interesting papers that I’ve come across so far.

## Chutes and Ladders

So my 4 year old daughter was sick yesterday and I spent part of the afternoon playing chutes and ladders with her. (She won every game because there is apparently a law in my house that the sick child always wins the game).

So I got to thinking about how many turns is the typical game of chutes and ladders. And also what’s the least amount of spins you need to finish the game.

So like every good father, I wrote a simulation!

So first, I wanted to know the distribution of the number of turns it takes a single player to complete the game. I simulated the game 10000 times and found a median of 30 spins with an average of 35.84 spins. In the 10000 simulations I performed the largest number of spins was 243 (I would have quit at about 50 spins) and the lowest number was 7 spins, which happened 21 times in the 10000 simulations.

You can see a histogram of the distribution of the the number of turns it would take a single player to complete the game.

And for fun, here is one way to win the game in 7 moves. The spins in the game below are 1, 6, 6, 1, 1, 6, 6. (Some other ways to do it include: (4, 6, 2, 6, 6, 6, >3) and (4, 6, 3, 5, 6, 5, 6), with this second one actually including the player hitting a slide!)

But most people don’t play chutes and ladders by themselves. So how long will the game take before anyone you are playing with wins the game? If you have two people the average number of turns is 24.1 with a median of 21 turns. Three players will last an average of 19.9 turns with a median of 18. and four players will average 17.66 turns with a median of 54.

So, that it’s. Really important summer stuff that I’ve been doing.

If you are interested, here is another article about chutes and ladders and here is a link to my code is here on github.

Cheers.

## Let’s clear up what efficacy means when we talk about vaccine efficacy

## What does 95% efficacy even mean?

The Pfizer Covid-19 vaccine has an efficacy rate of 95%. The Moderna Covid-19 vaccine has an efficacy of 94.1%. The Johnson and Johnson Covid-19 vaccine has efficacy of 66.3%.

But what does this MEAN?

In my casual observation, it seems to me that there are a lot of people who see these numbers and think, quite reasonably, that 95% effective means that 5% of the people who get the vaccine will get Covid-19. Or, if you were to get the Johnson and Johnson vaccine, there is still a 33.7% chance that you’ll get Covid-19. So, they then make the argument that if there is still about a 1 in 3 chance that you’ll get Covid even AFTER the vaccine, why even bother getting the vaccine?

Well, that’s not a correct interpretation of efficacy rate.

I will illustrate this with some simple examples.

**Example 1 **

Let’s say that we find 10,000 people and we inject them with a placebo. And we find another 10,000 people and we inject them with a vaccine. We follow all 20,000 for 90 days to see if they develop the disease of interest (in this case Covid-19).

Let’s say that 5,000 people who received the placebo get the disease while only 250 of the vaccinated group get the disease. In this case we have the following quantities:

Incidence rate UNvaccinated: 5,000 / 10,000 = 0.5 (or 50%)

Incidence rate vaccinated: 250 / 10,000 = 0.025 (or 2.5%)

(Note: Incidence rates are also known as “attack rates”. I didn’t know that until this morning. I’ve always just called these incidence rates).

Now using these incidence rates, we can calculate something called relative risk (RR):

RR = Incidence rate vaccinated / Incidence rate UNvaccinated = 0.025 / 0.5 = 0.05

The efficacy is then defined as 1 – RR = 1 – 0.05 = 0.95 (or 95%).

So in this scenario the vaccine was “95% effective” while 2.5% of the vaccinated group developed the disease.

*(Note: You can also calculate efficacy this way and get the exact same answer: *

*Efficacy = (Incidence rate UNvaccinated – Incidence rate vaccinated) / Incidence rate UNvaccinated = (0.5 – 0.025) / (0.5) = 0.95*

*It’s exactly the same result.)*

**Example 2 **

let’s look at a second example with the same initial set up: we find 10,000 people and we inject them with a placebo. And we find another 10,000 people and we inject them with a vaccine. We follow all 20,000 for 90 days to see if they develop the disease of interest (in this case Covid-19).

Let’s say that 100 people who received the placebo get the disease while only 5 of the vaccinated group get the disease. In this case we have the following quantities:

Incidence rate UNvaccinated: 100 / 10,000 = 0.01 (or 1%)

Incidence rate vaccinated: 5 / 10,000 = 0.0005 (or 0.05%)

Now using these incidence rates, we can calculate something called relative risk (RR):

RR = Incidence rate vaccinated / Incidence rate UNvaccinated = 0.0005 / 0.01 = 0.05

The efficacy is then defined as 1 – RR = 1 – 0.05 = 0.95 (or 95%).

So in this scenario the vaccine was ALSO “95% effective” while only 0.05% of the vaccinated group developed the disease.

**Takeaways**

- In the first example given here, 2.5% of the vaccinated group developed the disease, and in the second example, 0.05% of the vaccinated group developed the disease, but in BOTH EXAMPLES the efficacy was 95%.
- Vaccine efficacy is a RELATIVE reduction in risk when compared to a placebo group.
- There are many different incidence rates that will result in a 95% efficacy.
- This is why a vaccine that has efficacy of 50% is really an incredible vaccine. It doesn’t mean that 50% of the people who get the vaccine will get the disease; it means that the relative risk has been reduced by 50%! Which is a ton!
- Someone should get on national television and explain this to the American people.

Further reading:

- https://www.thelancet.com/journals/laninf/article/PIIS1473-3099(21)00075-X/fulltext
- https://www.cdc.gov/csels/dsepd/ss1978/lesson3/section6.html
- https://www.medicalnewstoday.com/articles/what-is-vaccine-efficacy
- https://www.medicalnewstoday.com/articles/what-is-vaccine-efficacy

Cheers.

## Madness of march: Fun facts!

Some fun facts about the Sweet 16 in the 2021 NCAA tournament:

- The average seed of a team in the Sweet 16 this year is 5.875.
- The only seeds not represented in the Sweet Sixteen are 9, 10, 13, 14, and 16.
- This means there are 11 unique seeds represented in the Sweet Sixteen. There have only been 11 unique seeds in the Sweet Sixteen twice before: 1986 and 1990. In 1990, the Sweet Sixteen was missing the 9 seed and 13, 14, 15, and 16. In 1986, the 9, 10, 13, 15 and 16 seeds were missing.
- The seeds 1-8 are represented in this years Sweet Sixteen. That has only happened five times before in 1986, 1990, 2000, 2004 and 2008. In 2004, the seeds 1 through 10 were all represented.

- There are 9 teams in the Sweet Sixteen with a seed of 5 or higher (i.e. teams that “shouldn’t have made it”). This has happened only 4 times before in 2018, 2000, 1990, and 1996.
- There has NEVER been a Sweet Sixteen (going back to 1985) with 4 teams seeded 11 or higher. This year we have two 11’s, a 12, and a 15. Three teams with a seed of 11 or higher have made the Sweet Sixteen (1985, 1986, 2011, 2013).
- In 1999, there were 5 teams with a double digit seed in the Sweet Sixteen , the most ever. This year there are 4.
- The sum of the seeds in the Round of 32 was 210 this year. That is tied for the second highest ever (in 2016 the sum was 215 (TEN double digit seeds won their first round game!) and in 2012 the sum was 210).

- The sum of the seeds in the Sweet 16 is 94. The second highest ever was in 1986 at 89. The lowest sum of the seed in a Sweet Sixteen was 49 in 2009. (Note the lowest possible is 40).

- Here is a density of the seeds in the Sweet 16. The black line is 2021. The flatter this estimate the more “madness”. A high peak on the left with a heavy right skew would indicate a very “chalky” year tournament.

- Finally, here are the empirical CDFs of past tournaments with 2021 in red. The more “madness” the lower this curve will be. (I think the area under this curve would be an interesting way to measure the “madness” of a tournament.

Buy my NFTs. Or buy an actual print if you are still into owning physical things.

Cheers.

## Nolan Arenado

The Rockies just traded Nolan Arenado to the Cardinals for pitcher Austin Gomber, Tony Locey, and Jake Sommers and infielders Mateo Gil and Elehuris Montero. And the Rockies will CONTINUE TO PAY Arenado even though he’s playing in St. Louis.

So this trade is…..to put it bluntly….fucking stupid. Nolan Arenado is one of the best players in baseball. I know this. You know this. But I didn’t realize and maybe you didn’t realize how fucking good he is. So I got out the Lahman database (shoutout to Sean Lahman) and check for myself how good he is.

Turns out Nolan Arenado is *even better* than I though he was.

Let’s play a little game: In the last 5 full seasons (2015-2019) what player had the most hits?

The answer is Charlie Blackmon with 940 hits. The next three players with the most hits are Jose Altuve (938), Mookie Betts (910), and…..Nolan Arenado (906). No other player has more than 900 hits in the last 5 full seasons. Altuve and Betts are the 2017 and 2018 AL MVPs, respectively. And, more importantly, Charlie Blackmon has a dope ass beard. (Compared to my beard which you can think of as some sort of reference level).

Ok. Next question. Who has the most HR in the last 5 complete seasons?

If you guessed Nolan Arenado, you’re an idiot. It’s Nelson Cruz with 204, you absolute complete incompetent. Actually, Arenado isn’t a bad guess. BECAUSE HE WAS SECOND! He had 199 HR over this period (including 8 in the 2020 Covid shortened season).

Let’s do it again with runs. Arenado is 4th with 519 runs. The three players ahead of him: Betts, Blackmon, Trout.

How about doubles? Arenado is 4th with 190. Behind Betta, Boegaerts, and Castellanos.

Ok. Last one: Who has the most RBI in the last 5 complete seasons? Number 10 is Albert Pujols with 472. From 9 through 2 is goes: Khris Davis (474), Bryce Harper (486), Jose Abreu (504), Paul Goldschmidt (505), J.D. Martinez (509), Anthony Rizzo, Nelson Cruz (522), Edwin Encarnacion (538). Guess who is number 1, motherfucker: Nolan Goddam Arenado. With 621 RBI! That’s 83 (EIGHTY THREE!!!!) more than the next closest. That’s 16.6 more RBI on average per year than the guy who is number 2!

Arenado’s RBI numbers are bonkers. From 2015 through 2019 he had 130, 133, 130, 110, 118 (with 26 in 2020).

Holy. Shit. The Rockies aren’t even pretending to try to win. This trade sucks. Sucks. Sucks.

You can find my code, as always, here.

Cheers.