Category Archives: Baseball

MLB Rankings – 5/7/2012

StatsInTheWild MLB rankings as of May 7, 2012 at 8am.

Team Rank Change Record ESPN TeamRankings.com
Baltimore 1 ↑5 19-9 6 1
Texas 2 ↓1 18-10 2 2
St. Louis 3 ↓1 17-11 5 6
Atlanta 4 ↓1 18-11 7 5
Tampa Bay 5 ↑3 19-10 1 3
Washington 6 ↓1 18-10 4 4
Toronto 7 ↑2 16-13 8 7
LA Dodgers 8 ↓4 18-10 3 8
NY Yankees 9 ↓2 15-13 9 9
Miami 10 ↑10 14-14 22 17
Houston 11 ↑12 13-15 23 15
Cincinnati 12 ↑3 14-13 12 11
Philadelphia 13 ↑3 14-15 14 20
Arizona 14 ↓1 14-15 11 18
Cleveland 15 ↑2 15-11 13 10
San Francisco 16 ↓6 14-14 16 21
NY Mets 17 ↓5 15-13 15 12
Chicago WSox 18 ↓4 13-15 19 14
Detroit 19 ↑2 14-13 10 13
Oakland 20 ↑2 15-14 21 16
Boston 21 ↓10 11-16 24 22
Seattle 22 ↓4 13-17 26 19
LA Angels 23 ↑3 12-17 18 27
Colorado 24 ↓5 12-15 17 25
Pittsburgh 25 ↓1 12-16 25 23
Milwaukee 26 ↓1 12-16 20 26
Chicago Cubs 27 11-17 27 24
San Diego 28 9-20 29 29
Kansas City 29 9-18 28 28
Minnesota 30 7-20 30 30

Past Rankings:

4/30/2012

4/23/2012

4/16/2012

4/13/2012

Cheers.

Every U.S. Sports Championship, In Convenient Infographic Form

Every U.S. Sports Championship, In Convenient Infographic Form

Cheers.

MLB Rankings – 4/30/2012

StatsInTheWild MLB rankings as of April 30, 2012 at 8am.

Team Rank Change Record ESPN TeamRankings.com
Texas 1 16-6 1 1
St. Louis 2 ↑2 14-8 2 5
Atlanta 3 14-8 7 3
LA Dodgers 4 ↑1 16-6 4 2
Washington 5 ↓3 14-8 5 6
Baltimore 6 ↑7 14-8 12 9
NY Yankees 7 ↓1 12-9 6 7
Tampa Bay 8 ↑3 14-8 3 4
Toronto 9 ↓2 12-10 9 14
San Francisco 10 ↑5 12-10 13 15
Boston 11 ↑16 10-11 10 10
NY Mets 12 13-9 16 11
Arizona 13 ↑4 11-11 15 19
Chicago WSox 14 ↓6 11-11 19 12
Cincinnati 15 ↑9 11-11 17 16
Philadelphia 16 ↑2 10-12 11 22
Cleveland 17 ↓3 11-9 14 23
Seattle 18 ↑4 11-12 21 13
Colorado 19 ↓3 10-11 18 18
Miami 20 ↓10 8-13 23 25
Detroit 21 ↓12 11-11 8 8
Oakland 22 ↓2 11-12 22 21
Houston 23 ↓4 8-14 26 24
Pittsburgh 24 ↑1 9-12 25 17
Milwaukee 25 ↓2 10-12 20 20
LA Angels 26 ↓5 7-15 24 28
Chicago Cubs 27 ↑2 8-14 28 27
San Diego 28 ↓2 7-16 29 26
Kansas City 29 ↓1 6-15 27 30
Minnesota 30 6-15 30 29

Past Rankings:

4/30/2012

4/23/2012

4/16/2012

4/13/2012

Cheers.

Imperfect Game: The next start after a perfect game

Last week Phillip Humber became the 21st pitcher to pitch a perfect game in major league baseball. Right now as I write this, he has been ousted from his next start after allowing 9 earned runs in five innings. This got me wondering how other pitchers have performed in their next start after their perfect game. Obviously, thing have to get worse in the next start, but things seem to get much worse. Using baseball-reference.com, I looked up game logs for the next start after the perfect game.

Here are the results (in a google doc).

For the 17 games I could find box scores for, including Humber’s next start, these pitchers have a combined record of 5-8 (originally said 4-9) with 4 no decisions, an ERA of 5.67 and a WHIP of 1.466. That’s certainly not perfect.  Terrible might be a better word for it.  I guess it’s hard to even be average once you’ve experienced perfection.

Cheers.

Imperfect Game: The next start after a perfect game

Last week Phillip Humber became the 21st pitcher to pitch a perfect game in major league baseball. Right now as I write this, he has been ousted from his next start after allowing 9 earned runs in five innings. This got me wondering how other pitchers have performed in their next start after their perfect game. Obviously, thing have to get worse in the next start, but things seem to get much worse. Using baseball-reference.com, I looked up game logs for the next start after the perfect game.

Here are the results (in a google doc).

For the 17 games I could find box scores for, including Humber’s next start, these pitchers have a combined record of 5-8 (originally said 4-9) with 4 no decisions, an ERA of 5.67 and a WHIP of 1.466. That’s certainly not perfect.  Terrible might be a better word for it.  I guess it’s hard to even be average once you’ve experienced perfection.

Cheers.

Jigger Statz

StatsInTheWild is proud to sponsor the Jigger Statz page on baseball-reference.com.

Cheers.

MLB Rankings – 4/23/2012

StatsInTheWild MLB rankings as of April 22, 2012 at 9pm.

Team Rank Change Record ESPN TeamRankings.com
Texas 1 ↑1 13-3 1 1
Washington 2 ↑1 12-4 6 6
Atlanta 3 ↑8 10-6 13 3
St. Louis 4 11-5 3 8
LA Dodgers 5 ↓4 12-4 5 5
NY Yankees 6 ↑4 9-6 4 9
Toronto 7 ↓1 9-6 11 13
Chicago WSox 8 ↑1 9-6 9 4
Detroit 9 ↓4 10-6 2 2
Miami 10 ↑9 7-8 21 20
Tampa Bay 11 ↑13 9-7 8 11
NY Mets 12 ↓3 8-6 10 7
Baltimore 13 ↓1 9-7 18 12
Cleveland 14 ↑3 8-6 23 10
San Francisco 15 ↑1 7-7 14 16
Colorado 16 ↑3 8-7 19 24
Arizona 17 ↓10 8-8 7 21
Philadelphia 18 ↓3 7-9 12 17
Houston 19 ↓1 6-10 26 22
Oakland 20 ↑2 8-9 24 26
LA Angels 21 6-10 17 27
Seattle 22 7-10 22 15
Milwaukee 23 ↑4 7-9 16 14
Cincinnati 24 ↓1 7-9 20 18
Pittsburgh 25 6-9 28 19
San Diego 26 ↑3 5-12 30 29
Boston 27 ↓14 4-10 15 23
Kansas City 28 ↓2 3-12 25 30
Chicago Cubs 29 ↓1 4-12 27 28
Minnesota 30 5-11 29 25


Cheers.

MLB Rankings – 4/16/2012

StatsInTheWild MLB rankings as of April 16, 2012 at 8am.

Team Rank Change Record ESPN TeamRankings.com
LA Dodgers 1 ↑1 9-1 5 4
Texas 2 ↑1 8-2 1 2
Washington 3 ↑1 7-3 6 5
St. Louis 4 ↑1 7-3 3 6
Detroit 5 ↓4 6-3 2 3
Toronto 6 ↑1 5-4 11 11
Arizona 7 ↓1 6-3 7 9
NY Mets 8 ↑3 6-3 10 7
Chicago WSox 9 ↑1 5-3 9 1
NY Yankees 10 ↓2 5-4 4 13
Atlanta 11 ↑9 5-4 13 12
Baltimore 12 ↑1 5-4 18 16
Boston 13 ↑18 4-5 15 8
Seattle 14 ↑2 6-5 22 14
Philadelphia 15 ↓1 4-5 12 19
San Francisco 16 ↓2 4-5 14 18
Cleveland 17 ↑12 4-4 23 15
Houston 18 ↓3 4-5 26 24
Miami 19 ↑2 4-6 21 23
Colorado 20 ↓1 4-5 19 21
LA Angels 21 ↑3 3-6 17 29
Oakland 22 4-6 24 27
Cincinnati 23 ↑3 4-6 20 17
Tampa Bay 24 ↓15 4-5 8 10
Pittsburgh 25 ↓2 3-6 28 22
Kansas City 26 ↓14 3-6 25 28
Milwaukee 27 ↓10 4-6 16 20
Chicago Cubs 28 ↓3 3-7 27 26
San Diego 29 ↓2 2-8 30 25
Minnesota 30 ↓2 2-7 29 30


Cheers.

MLB Rankings – April 13, 2012 (Friday 13th edition)

StatsInTheWild MLB rankings as of April (Friday) 13, 2012 at noon.

Some interesting things:

  • SITW, ESPN, and TR all have the Tigers number 1 and the Yankees ranked 8.
  • My top 10 has all of the same teams in it as TeamRankings.com except for one team: Toronto.
  • My top 10 has all of the same teams in it as ESPN except for 2 teams: LA Dodgers and Chicago White Sox.
Team Rank Change Record ESPN TeamRankings.com
Detroit 1 5-1 1 1
LA Dodgers 2 6-1 11 3
Texas 3 5-2 4 5
Washington 4 5-2 9 7
St. Louis 5 5-2 6 4
Arizona 6 5-1 3 6
Toronto 7 4-2 7 13
NY Yankees 8 3-3 8 8
Tampa Bay 9 4-2 2 2
Chicago WSox 10 3-2 22 9
NY Mets 11 4-2 17 11
Kansas City 12 3-3 21 18
Baltimore 13 3-3 20 21
Philadelphia 14 3-3 10 14
Houston 15 3-3 30 26
Seattle 16 4-4 19 12
Milwaukee 17 4-3 12 10
San Francisco 18 2-4 16 20
Colorado 19 2-4 18 29
Atlanta 20 2-4 23 24
Miami 21 2-5 14 25
Oakland 22 3-4 27 22
Pittsburgh 23 2-4 24 19
LA Angels 24 2-4 5 28
Chicago Cubs 25 2-5 28 23
Cincinnati 26 3-4 13 16
San Diego 27 2-5 29 15
Minnesota 28 2-4 26 30
Cleveland 29 1-4 25 27
Boston 30 1-5 15 17


Cheers.

JSM (in the wild)

I’ve been in Vancouver the past week for the Joint Statistical Meetings (JSM) . Here is a collection of my thoughts and comments from the few days I was at the conference.

On Monday I went to the section on Survey Research Methods and saw Meena Khare, of the National Center for Health Statistics (NCHS) and Laura Zayatz of the United States Census give talks. They both spoke about measures that their institutions go through to release data to the public. The NCHS looks for uncommon combinations of variables that could be used to possible de-identify the data. Both organizations first remove obvious identifiers and then go on to make the released data more private. For instance, if the number of observations with a unique combination of variables in a data set is n and the number of observations with the unique combination is N, they would consider any combination of variables where n/N<.33 at risk for disclosure. At the U.S. Census, they have used something called data swapping to protect public release data sets in the last two Censuses (2000 and 2010). Along with data swapping, the Census will also be using partially synthetic data to maintain confidentiality to protect individual privacy in public release data in 2010.

Several things strike me about this.
-The methods that these organizations use to protect confidentiality are certainly going to increase privacy compared to a release of raw data, however, there doesn’t seem to be any way to know that what is being done is providing “enough” privacy.
-It’s clear that many different government organizations have issues which require some use of disclosure limiting techniques, however, it seems that each organization is creating its own rules and there is limited discussion going on between organizations to conceive of a standard policy for data sharing.
-There doesn’t actually seem to be any definition of what is considered a disclosure. For instance, if government data is released and I discover through some technique that someone definitely has HIV, then clearly a disclosure has taken place. However, if I use the same data to discover that someone definitely does not have HIV, a disclosure has still taken place, but the consequence is much less damaging. Furthermore, consider a situation where prior the the data release, I know a particular individual has a 50 percent chance of having HIV. After the data release, I can infer that there is a 99 percent chance that they have HIV. Clearly, I would consider this a disclosure. But what if the probabilities shift from only 50 percent (pre-data release) to 75 percent (post-data release) or 50 percent to 55 percent. At what point is “too much” information being released. It seems as if this issue receives less attention than is warranted.
-Finally, I believe that the ultimate solution to the disclosure problem is a careful combination of policy and disclosure limiting techniques. Policy issues include defining how much privacy must be maintained by a given technique, as well as, legal consequences for knowingly disclosing private information. Statistics has an obligation to provide increasingly improving statistical disclosure techniques along with metrics for measuring the privacy of a given technique.

Later on Monday, I saw the tail end of the talk by David Purdy titled “Statisticians: 3, Computer Scientists: 35”. The abstract for the talk was:
“John Tukey and Leo Breiman warned us that a day would come when statistics would need to focus more on computing, or risk losing good students to computer science. The Netflix Prize provides many examples of how our field needs to do more.

In the top 2 teams, participants with a computer science background vastly outnumbered those with a statistics background. There are a number of lessons that the field of statistics can learn from the fact that undergraduates in CS were well equipped to compete, while statisticians at all levels were not well prepared to implement advanced algorithms.

In this talk, I will address methodological issues arising with such a large, sparse dataset, how it demands serious computational talents, and where there is ample room for statistics to make contributions.”

I only saw the end of the talk, but I feel like I got the point. He notes how programs in statistics need to expose students to more aspects of computing. One quote from his talk that particularly shocked me was from a prominent statistician referring to the Netflix prize data set: (I’m paraphrasing) “I can’t do anything with the data, there is just too much of it.” (If anyone knows the actual quote, I would love to have it). Too little data may often be a problem, but too much data should be a blessing, rather than a curse.

When he is talking about computing he is referring to implementing complex algorithms to analyze the data, however, in my experience I have seen people struggle with simply managing data of this size. This is a simple problem to deal with, but, in my experience, I have both had this happen to myself and seen it happen to others. When I was in grad school working towards my master’s degree right out of undergraduate, we were given a problem in a consulting class with a “large” (several thousand observations) amount of data (well, “large” to someone with no experience managing data.) We (my group) knew exactly what we wanted to do with the data, but we are unable to manage the data in a way that would make it useful for analysis. So we did nothing. The moral of the story here is that, while we were taught well the techniques which were useful for analyzing the data, we were never taught and had never learned any useful data manipulation techniques, rendering our statistical educations useless. It was not until I got my first job that I learned, out of necessity, data management techniques including SAS data steps, SAS macros, and SQL.

When I returned to school to pursue a Ph. D., I saw many students with no work experience struggling through all of the same problems that I had with managing data. The same old “I know exactly what I want to do, but I can’t organize the data.” Often times in grad classes, books or teachers will describe a data set as “large” when it has several hundred or several thousand observations. This seems inadequate preparation for working in industry, as my first jobs often dealt with data sets with millions of observations and, later, a summer consulting project involved billions of observations.

Currently, there are no required computing or data management classes in my program for earning a Ph. D. in statistics. I think there should be a required class in every statistics program covering data management issues and, at least, a solid introduction to programming.

After, David Purdy’s talk, Chris Volinksy (Follow on Twitter) and he took questions. One interesting question that came up was about a second Netflix prize. However, Chris noted that this had to be cancelled because of privacy concerns. I’ve written before (or at least posted on Twitter) about some researchers who claim to have de-anonymized the data from the Netflix prize and, as a result, a lawsuit has been filed. (Netflix’s Impending (But Still Avoidable) Multi-Million Dollar Privacy Blunder) Whether you agree with canceling the prize over privacy concerns or not, it is clear that disclosure limitation is currently a big issue that certainly cannot be ignored.

On Tuesday, I went to one of the sports research sections and saw two talks before I left to go see a talk about partially synthetic data in longitudinal data. The first speaker, Shane Jensen, spoke about evaluating fielders abilities in baseball using a method he proposed called Spatial Aggregate Fielding Evaluation (SAFE). The previous link explains how their evaluation of players works and gives measures of performance for each player. Probably, the most shocking result of his work is that, averaged over 2002-2008, SAFE evaluated Derek Jeter as the worst shortstop who met the minimum number of balls in play (BIP). Alternatively, SAFE rates Alex Rodriguez as the second best shortstop over this same period, even though he now plays third to allow Jeter to play SS.

The next speaker was Ben Baumer, statistical analyst for the Mets (and native of the 413 area code). He spoke about his paper, “Using Simulation to Estimate the Impact of Baserunning Ability in Baseball“. One of the interesting things I took away from his talk is that he claims that players’ speed used to break up a double play is one of the important aspects of base running, but this is often largely or completely ignored as an evaluation tool of a players base running ability.

Before I end, I’d like to say thanks to all the speakers that I saw speak this past week and, finally, I’ll leave you with a view of Vancouver from the convention center.

Cheers.