Category Archives: Uncategorized
The Harvard Sports Analysis Collective
By Ben Blatt
In 1964, Mosteller and Wallace published Inference and Disputed Authorship: The Federalist. The paper used statistical analysis to try to determine if James Madison, Alexander Hamilton, or John Jay was the author of the unaccredited essays that were part of The Federalist Papers. They approached this historical mystery by using differences in word frequencies and Bayesian statistics. While controversial, similar methods have been used to investigate other authorship debates such as Shakespeare’s sonnets and plays. The same can be done for sports articles. While it would certainly be easier to look at the author’s name right underneath the title than to perform a statistical analysis of authors, I thought it would be fun anyways.
View original post 805 more words
Math teachers have been using the “German Tank Problem” for awhile to teach estimators. It goes something like this.
The Allies capture five German tanks. Suppose that the serial numbers on the tanks are 15, 23, 59, 83, and 109. Provide an estimate of the number of tanks that were produced.
You can see that your estimate would be a great deal lower than the estimate if the serial numbers were 1015, 2394, and 9438.
The story and accompaning math problem was written about in the Guardian. My favorite line, which turned out to be true, was this:
The statisticians believed that the Germans, being Germans, had logically numbered their tanks in the order in which they were produced.
It turns out that the statisticians were spot on:
By using this formula, statisticians reportedly estimated that the Germans produced 246 tanks per month between June 1940 and September 1942…
View original post 148 more words
I’d like to see density estimates added to the histograms, but this is interesting.
There are currently 174 books on my Amazon wishlist that I could order directly from Amazon. (My wishlist has a total of 195 books, but 21 are only available from other sellers.) Total price is approximately $3,549 (I rounded all prices to whole dollars), for a mean of approximately $20 per book.
But the median price of a book on my wishlist is (again to the nearest whole dollar) $16; the difference between the median and the mean is a hint that the distribution is skewed. And there are actually two peaks — one centered on $10 and one centered on $16-17. The distribution looks like this:
I’ve cut off the histogram at $100, which omits Mitchell’s Machine Learning at a list price of $168.16. Here’s a zoomed-in version omitting the 23 most expensive (all those over $30):
The two peaks are easy to explain: paperbacks and hardcovers, respectively. The…
View original post 82 more words
What a great quote: “As an employer I want the best prepared and qualified employees. I could care less if the source of their education was accredited by a bunch of old men and women who think they know what is best for the world. I want people who can do the job. I want the best and brightest. Not a piece of paper.” -Mark Cuban
This is what I see when i think about higher education in this country today:
Remember the housing meltdown ? Tough to forget isn’t it. The formula for the housing boom and bust was simple. A lot of easy money being lent to buyers who couldn’t afford the money they were borrowing. That money was then spent on homes with the expectation that the price of the home would go up and it could easily be flipped or refinanced at a profit. Who cares if you couldn’t afford the loan. As long as prices kept on going up, everyone was happy. And prices kept on going up. And as long as pricing kept on going up real estate agents kept on selling homes and finding money for buyers.
Until the easy money stopped. When easy money stopped, buyers couldn’t sell. They couldn’t refinance. First sales slowed, then prices started falling…
View original post 1,376 more words
“The serious point of the talk, though, is that everyone should learn some computer science, preferably in the context of intellectually interesting real-world applications. “
Robert Sedgewick has the slides for a talk, Algorithms for the Masses on his web site.
My favorite slide is the one titled “O-notation considered harmful” — Sedgewick observes that it’s more useful to say that the running time of an algortihm is ~aNc (and back this up with actual evidence from running the algorithm) than to have a theorem that it’s O(Nc) (based on a model of computation that may or may not be true in practice).
The serious point of the talk, though, is that everyone should learn some computer science, preferably in the context of intellectually interesting real-world applications. This is what Sedgewick is doing in his Princeton course and in his book with Kevin Wayne, Algorithms, 4th edition, which I confess I have not read. There’s a Coursera course, in six-week parts, starting in August and November respectively. For a lot of…
View original post 41 more words
What You're Doing Is Rather Desperate
At any R Q&A site, you’ll frequently see an exchange like this one:
Q: How can I use a loop to […insert task here…] ?
A: Don’t. Use one of the apply functions.
So, what are these wondrous apply functions and how do they work? I think the best way to figure out anything in R is to learn by experimentation, using embarrassingly trivial data and functions.
If you fire up your R console, type “??apply” and scroll down to the functions in the base package, you’ll see something like this:
Let’s examine each of those.
1. apply
Description: “Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.”
OK – we know about vectors/arrays and functions, but what are these “margins”? Simple: either the rows (1), the columns (2) or both (1:2). By “both”, we mean “apply the…
View original post 1,006 more words
People who want to learn the very basics of R may find these videos made by some Berkeley grad students useful.
Here’s a fun fact. NHL first round winners were 45-54 in shootouts in the regular season. First round losers were 63-43. Here are the match ups (higher regular season point total first, shootout record in parentheses, winner in bold):
- Rangers (4-5) vs. Ottawa (6-4)
- Bruins (9-3) vs. Capitals (4-4)
- Devils (12-4) vs. Panthers (6-11)
- Penguins (9-3) vs. Flyers (4-7)
- Canucks (8-7) vs. Kings (6-9)
- Blues (4-10) vs. Sharks (9-5)
- Blackhawks (7-7) vs. Coyotes (6-10)
- Predators (5-5) vs. Red Wings (9-3)
So, the team with the lower shootout win percentage won seven out of eight series. The team with the higher point total only won four out of eight (Rangers, Blues, and Predators), in part because good shootout records inflated some teams’ point totals. Why do we still have shootouts again?
A recent post on the Junkcharts blog looked at US weather dataand the importance of explaining scales (which in this case went up to 118). Ultimately, it turns out that 118 is the rank of the data compared to the previous 117 years of data (in ascending order, so that 118 is the highest). At the end of the post,
I always like to explore doing away with the unofficial rule that says spatial data must be plotted on maps. Conceptually I’d like to see the following heatmap, where a concentration of red cells at the top of the chart would indicate extraordinarily hot temperatures across the states. I couldn’t make this chart because the NOAA website has this insane interface where I can only grab the rank for one state for one year one at a time. But you get the gist of the concept.
In this spirit…
View original post 87 more words

