Category Archives: Sports

2011 NFL Season Preview

Republican presidential debate, Obama addressing the nation, AND the start of the NFL season. It’s almost too much to handle.

Before we get to any NFL predictions, I’ll make a presidential prediction.  Mitt Romney is going to win the Republican nomination.  I don’t care what Perry’s poll numbers are right now.  I don’t think Republican’s will vote for a guy who’s first google auto-complete term is “gay”.  (Maybe I under-estimate Republican’s tolerance, but then again, maybe I don’t.)

Anyway, I’ve been toying with the idea of simulating the NFL season for a little while now (I did a bit of this last year, later in the season.) This year I’ve done it before any games have been played, so we’ll see how this model works out. it’s a pretty simply model and uses only data from the 2010-2011 regular season and playoffs.  Using that data, I used a logistic regression model to model the probability that one team beats another team.  Then I simulated the upcoming season 5000 times.

Let’s begin with some pictures.  The first has nothing to do with the simulations, but it’s interesting.  It also give me a chance to quote myself.  So, here is a plot of some Chernoff faces based on the final 2010 NFL regular season team statistics.  (I posted about this before here)

My comments from before:

The face represents the offense and the defense is represented by hair. The size of the nose indicates sacks, the ears indicate turnovers (ear width is interceptions; ear height is forced fumbles).  The eyes indicate penalties and, finally, the size of the mouth indicates wins with a smiling face if the team made the playoffs (a really nice touch, if you ask me.)  The face at the bottom right indicates the league leader.

Some observations on the NFL faces:  The two superbowl teams last year (Pittsburgh and Green Bay) are both located at the bottom of the graph and there faces look very, very similar.  San Diego looks similar to to both Green Bay and Pittsburgh (similar face, nose, eyes, and hair), but the big differences are the ears and, of course, the San Diego face is frowning.  Another thing that pops out at me is how similar Houston and New England look to each other.  They have very similar face shape, eyes, and hair.  The big differences are the nose and ears (sacks and turnovers).

Here is a graph with 32 side by side boxplots representing each of the NFL teams.  Each boxplot displays the distribution of the predicted number of wins for each team.  The teams are in order of the SITW power ranking (which means it’s mostly made up).  I have also included a red W for how many wins the team had last year, a green dollar sign for the over-under betting line, and a blue P indicating whether or not the team made the playoffs last year.

Now it’s time for my Super Bowl favorites table.  The first column lists the team, the second column lists the my predicted odds to win the 2012 Super Bowl, and column three displays my predicted probability of each team making the playoffs.  One interesting thing to note in the first couple lines of this table is that Pittsburgh is more likely to win the Super Bowl than Baltimore, but Baltimore is more likely to make it to the playoffs than Pittsburgh.  This is a result of the NFL scheduling system.  Pittsburgh and Baltimore share the exact same schedule except for two games.  Those two differing games for Pittsburgh are New England and Kansas City whereas those two games for Baltimore are the New York Jets and the San Diego Chargers.  So what is happening is that because Pittsburgh has the chance to play New England in the regular season, in the simulations, when they do make it to the playoffs, they are most often making the playoffs as a 1 seed.  Baltimore is making the playoffs more often, but they are rarely (relative to Pittsburgh) simulated to be a 1 seed.  Remember, this table is ordered by odds that a team wins the Super Bowl; it’s not ordered best to worst team.

 Team S.B. XLVI Odds  Prob(Make Playoffs)
 New England  6.8  .7018
 Pittsburgh  9.4  .6402
 Atlanta  10  .7164
 Baltimore  11  .7156
 New York Jets  14  .5352
 Chicago  16  .5842
 Green Bay  16  .5402
 Tampa Bay  21  .4636
 Philadelphia  21  .449
 New York Giants  24  .4294
 New Orleans  28  .393
 Indianapolis  37  .4328
 Miami  46  .3174
 Seattle  60  .3512
 San Diego  60  .3254
 Kansas City  63  .3596
 Minnesota  64  .2902
 Detroit  69  .3082
 Jacksonville  75  .2952
 Dallas  76  .2678
 Oakland  78  .3274
Washington  88  .2438
Cincinnati  99  .2514
Cleveland  103  .2538
St. Louis  105  .2916
Tennessee  108  .2872
Buffalo  131  .208
San Francisco  146  .2826
Arizona  160  .2732
Houston  160  .2048
Denver  216  .1442
Carolina  555  .1156

Below are some over-under bets that I like.  It seems like a lot of times forget that they are betting on the NFL. Every single team can beat every other team (See Miami beating New England as 13.5 point underdogs a few years ago.) Betters over value good teams and under value bad teams.  My two favorite bets here are Green Bay and Cincinnati.  Green Bay had a great run through the playoffs last year, but they still only won 10 regulars season games in 2010.  Add that to the fact that Aaron Rodgers is a concussion waiting to happen and winning 12 games seems like a difficult task.  Cincinnati has been blessed by the scheduling gods.  Not only did they finish fourth in their division last year earning them games against Denver and Buffalo they have also drawn the NFC west division giving them games against Arizona, Seattle, San Francisco, and St. Louis.  Then add to that two games against Cleveland and it’s not to hard to see 6+ wins in their future.  And think about this, they start their season Cleveland, Denver, San Francisco, and Buffalo.  Is it that far fetched that they start 4-0?  (Yes, it is that far fetched.  Just saying is all….)

Team Bet Odds
Green Bay  Under 11.5  -145
 San Diego  Under 10  +115
 Minnesota  Under 10  +110
 Houston  Under 9  +145
 Philadelphia Under 10.5  +120
 Dallas Under 9  -120
 Cincinnati Over 5.5  +135
 Carolina Over 4.5  even
 Seattle Over 6  +125
 Buffalo Over 5.5  -135
 Oakland Over 6.5  +110
New England Under 11.5 -110

Other bets that intrigue me.

Team Bet Odds
 Seattle  Win Division +900
 Oakland  Win Division +700
 Chicago  Win Division +600
 Washington  Win Division +2000
 Minnesota  Win Division +1200
 Kansas City  Win Division +500
 Baltimore AFC Champs +900
 Atlanta NFC Champs +600
 Tampa Bay NFC Champs +1500
 Seattle NFC Champs +4500
 Chicago NFC Champs +2000
 Washington NFC Champs +4500
 Atlanta Super Bowl Champs +1200
 Seattle Super Bowl Champs +8000
 Tampa Bay Super Bowl Champs +3000
 Baltimore Super Bowl Champs +2000
 Chicago Super Bowl Champs +4000

And finally, let’s make some predictions that will ultimately prove to be way off.  But it is fun to try here is what the playoffs will look like.

The AFC.

Team Seed Mean wins
 New England  1  10.116
 Pittsburgh  2  9.88
 Indianapolis  3  8.1634
 Kansas City  4  7.7488
 Baltimore  5  9.648
 New York Jets  6  9.3434

I know, I know.  It’s boring and it’s exactly the same as last years AFC playoff teams down to the seeds.  But wait until you see my NFC picks!

NFC

Team Seed Mean wins
 Atlanta  1  9.4892
 Chicago  2  8.9044
 Philadelphia  3  8.5046
 Seattle  4  7.6006
 Green Bay  5  8.8728
 Tampa Bay  6  8.7434

Ok.  Those weren’t that exciting either.  At least I made a stand with Tampa Bay, right?

And now for my Super Bowl prediction.  Based solely on the numbers I am taking new England over Atlanta.  That’s wicked boring though.  So my gut is taking Tampa Bay over Baltimore 21-20.  And I’m still picking Mitt Romney.

Cheers.

Chernoff Faces from aplpack

I’ve been playing around with the faces function from the R package aplpack.  I haven’t used it in a while, but there are some new features that I’ve either never noticed before or they are new.  Color has been added to the faces and you can now plot the faces.  There is also the superfluously fantastic option of displaying the faces as Santa Claus.

Here are some of my examples:

Golf: Statistics from several of my friends collected via oobgolf.com.  (I’m SITW on the lower right.) The face is handicap, the mouth is scoring average, the eyes are average putts, the hair is the percentage of fairways hit, nose is greens in regulation (GIR), and ears are the total number of rounds you play. The faces are plotted with fairway percentage on the x-axis and GIR on the y-axis.

Santa_Golf: Same golf data with Santa option.

NFL2010: Final NFL regular season team statistics.  The face represent the offense and the defense is represented by hair. The size of the nose indicates sacks, the ears indicate turnovers (ear width is interceptions; ear height is forced fumbles).  The eyes indicate penalties and, finally, the size of the mouth indicates wins with a smiling face if the team made the playoffs (a really nice touch, if you ask me.)  The face at the bottom right indicates the league leader.

Some observations on the NFL faces:  The two superbowl teams last year (Pittsburgh and Green Bay) are both located at the bottom of the graph and there faces look very, very similar.  San Diego looks similar to to both Green Bay and Pittsburgh (similar face, nose, eyes, and hair), but the big differences are the ears and, of course, the San Diego face is frowning.  Another thing that pops out at me is how similar Houston and New England look to each other.  They have very similar face shape, eyes, and hair.  The big differences are the nose and ears (sacks and turnovers).

 

Cheers.

##NFL CODE

library(aplpack)

 

x<-read.csv(“\StatsInTheWild\NFL2010.csv”,header=TRUE)

x[33,]<-x[32,]

x$abbr<-sort(c(“NE”,”NYJ”,”Mia”,”Buf”,”Pit”,”Bal”,”Cle”,”Cin”,”Ind”,”Jac”,”Hou”,

“Ten”,”KC”,”SD”,”Oak”,”Den”,”Phi”,”NYG”,”Dal”,”Was”,”Chi”,”GB”,”Det”,”Min”,”Atl”

,”NO”,”TB”,”Car”,”Sea”,”StL”,”SF”,”Ari”,”ZZ”))

x$abbr[27:28]<-c(“SF”,”Sea”)

x$abbr[33]<-“League Leader”

x$lab<-paste(x$abbr,x$W,sep=”: “)

x$TOP<-as.numeric(substring(x$TOP.x,1,2))

##Playoff Teams: creating a playoff indicator

rows<-c(2,3,6,12,14,16,19,20,22,24,25,28)

x$playoffs<-rep(0,33)

x$playoffs[rows]<-1

 

##Finding the league leader in all variables

num<-sapply(x,is.numeric)

x[33,num]<-sapply(x[,num],max)

def<-c(6,22:23,26:29)

x[33,def]<-sapply(x[,def],min)

x$lab<-paste(x$abbr,x$W,sep=”: “)

##Defining the names

names(x)[c(2,3)]<-c(“Wins”,”Losses”)

names(x)[c(13,14,15,16)]<-c(“Off PPG”,”Off YPG”,”Off Pass”,”Off Rush”)

names(x)[c(22,23)]<-c(“Penalties”,”Pen Yards”)

names(x)[c(26:29)]<-c(“Def PPG”,”Def YPG”,”Def Pass”,”Def Rush”)

names(x)[c(5:6)]<-c(“Points For”,”Points Against”)

pdf(“/StatsInTheWild/NFL2010.pdf”,width=15,height=10)

##Columns used for plotting

x<- x[order(x[,4]),]

plot.cols<-c(5,6)

##Offense = face, Defense = hair, penalty= eyes, Wins and playoffs = mouth, turnovers = ears

##Columns used for faces: which columns am i going to use for the data

col<-c(15,16,14,2,2,41,22,23,28,29,27,36,36,30,32)

##creating the faces without plotting them.

a<-faces(x[,col],labels=x$lab,face.type=1,plot=FALSE)

##creating text for the legend

g<-paste(a[[2]][,1],a[[2]][,2],sep=”: “)

##building the plot

plot(x[,plot.cols],bty=”n”,xlim=c(200,600),main=”2010 NFL Season”)

text(rep(540,15),seq(475,325,length.out=15),g)

##plotting the faces

plot.faces(a,x[,plot.cols[1]],x[,plot.cols[2]],width=30,height=30)

dev.off()

Watson (in the wild)

Watson, an IBM computer, recently competed on Jeopardy, demoralizing his opponents Ken Jennings and Brad Rutter. (Jennings describes what it was like to lose the Watson here.) Really it was an uneven playing field from the start. For example, “Watson’s brain showcases several IBM technologies. The hardware is jammed into 10 refrigerator-sized racks filled with Power7 server blades. To be exact, there are 90 Power750 servers filled with four processors each — and each processor has 8 cores, for a total of 2,880 cores altogether.” (InformaWorld) That’s a lot of cores for our tiny human brains to compete against. Therefore, I am calling for an all computer Jeopardy tournament. Similar to the netflix prize, different organization could compete in a tournament pitting their algorithms against other organizations’ algorithms with the winner collecting a large cash prize and valuable free publicity and marketing. Just imagine IBM’s Watson competing against Google’s BrinPage and Microsoft’s Gates. That would be pretty intense.

Cheers.

NFL Week 15 (in the wild)

14 week of the NFL Season are gone. 3 weeks remain.

There have been a few big changes in the projected playoff seeds. In the AFC, the Jets fall to a projected 6 seed and the Ravens take over the 5 seed. Also, Jacksonville and Kansas City change places with Jacksonville projected to be the three seed with Kansas City the 4 seed.
In the NFC, Green Bay is out and the Giants are in as the six seed.

Here are the updated SITW projected playoff seeds:
View the full rankings here.

Week 15 Playoff projections (through week 14):
Playoff Projections:
AFC Projected seeds (Expected Wins) [Probability they make playoffs]{Ranking}:
1. New England (13.6973)[1]{1}
2. Pittsburgh (12.5784) [.9999]{3}
3. Jacksonville (9.9711) [.8196]{13}
4. Kansas City (9.7525) [.6738]{16}
5. Baltimore (11.3688) [.9992]{4}
6. New York Jets (10.837) [.9088]{5}

NFC Projected seeds:
1. Atlanta (13.6973) [1]{2}
2. Philadelphia (11.2133) [.9151]{8}
3. Chicago (10.6212)[.9047]{10}
4. St. Louis (7.3868) [.5744]{22}
5. New Orleans (11.2239) [.9038]{6}
6. New York Giants (10.6128) [.6983]{12}

Cheers.

NFL Week 14 (in the wild)

The interesting thing about these playoff projections are that Philadelphia is projected to be the 2 seed in the NFC over Chicago even though Chicago has 1 more win than them. Take a look at the rest of the Eagle’s schedule and then look at the Bear’s remaining schedule and it becomes very clear why these are the projections.

You can view the full rankings here.

Week 14 Playoff projections:
Playoff Projections:
AFC Projected seeds (Expected Wins) [Probability they make playoffs]{Ranking}:
1. New England (13.6874)[1]{2}
2. Pittsburgh (12.4096) [1]{3}
3. Kansas City (10.425) [.843]{15}
4. Jacksonville (9.518) [.7124]{14}
5. New York Jets (11.8408) [.9998]{4}
6. Baltimore (11.2332) [.9982]{5}

NFC Projected seeds:
1. Atlanta (13.6874) [1]{1}
2. Philadelphia (10.9962) [.8438]{9}
3. Chicago (10.587)[.8484]{10}
4. St. Louis (7.4758) [.477]{23}
5. New Orleans (11.091) [.8424]{6}
6. Green Bay (10.3086) [.7648]{7}

My Rankings: (Wins)
Teams Wins
1 Atlanta (10)
2 NewEngland (10)
3 Pittsburgh (9)
4 NYJets (9)
5 Baltimore (8)
6 NewOrleans (9)
7 GreenBay (8)
8 TampaBay (7)
9 Philadelphia (8)
10 Chicago (9)
11 Miami (6)
12 Cleveland (5)
13 NYGiants (8)
14 Jacksonville (7)
15 KansasCity (8)
16 Tennessee (5)
17 Indianapolis (6)
18 Minnesota (5)
19 SanDiego (6)
20 Oakland (6)
21 Washington (5)
22 Seattle (6)
23 StLouis (6)
24 Houston (5)
25 Dallas (4)
26 SanFrancisco (4)
27 Denver (3)
28 Arizona (3)
29 Cincinnati (2)
30 Buffalo (2)
31 Detroit (2)
32 Carolina (1)

Cheers.

NFL Week 13 (in the wild)

First off, what a joke the NFC West is.
Although, I am excited about rooting for a 6 win team to make the playoffs and get a first round home playoff game. Nice work NFL.

A thought about being the 2 seed: I think in a year like this there may be some advantage to being the 2 seed rather than the 1 seed.

Let’s look at the AFC. Let’s assume that my playoff projections hold true. The wild card winners seeds will either be (3,4), (3,5), (4,6), or (5,6). After the first round of playoffs, the one seed plays the lowest remaining seed. This has to be the 4, 5, or 6 seed. So they are going to have to play Indianapolis, New York Jets or Pittsburgh. I would argue that all three of these teams are better than the Kansas City Chiefs. The two seed has to play either the 3, 4, or 5 seed. They are guaranteed to not have to play the Steelers, and they avoid the Jets in all scenarios except where both the Jets and Steelers win. This looks to me like an easier path to the AFC championship game.

Week 12 Playoff projections:
Playoff Projections:
AFC Projected seeds (Expected Wins) [Probability they win the Super Bowl]{Ranking}:
1. New England (12.92)[.199]{2}
2. Baltimore (11.90) [.12]{5}
3. Kansas City (10.05) [.0008]{15}
4. Indianapolis (8.88) [.0004]{14}
5. New York Jets (12.45) [.1292]{3}
6. Pittsburgh (11.77) [.1162]{4}

NFC Projected seeds:
1. Atlanta (13.45) [.349]{1}
2. Philadelphia (10.86) [.0208]{10}
3. Chicago (10.49) [.0114]{11}
4. Seattle (7.29) [0]{22}
5. New Orleans (10.86) [.0254]{6}
6. Green Bay (10.15) [.0138]{8}

My Rankings: (Wins)
1 Atlanta (9)
2 NewEngland (9)
3 NYJets (9)
4 Pittsburgh (8)
5 Baltimore (8)
6 NewOrleans (8)
7 TampaBay (7)
8 GreenBay (7)
9 Miami (6)
10 Philadelphia (7)
11 Chicago (8)
12 Tennessee (5)
13 NYGiants (7)
14 Indianapolis (6)
15 KansasCity (7)
16 Cleveland (4)
17 SanDiego (6)
18 Jacksonville (6)
19 Washington (5)
20 Minnesota (4)
21 Oakland (5)
22 Seattle (5)
23 Houston (5)
24 StLouis (5)
25 SanFrancisco (4)
26 Denver (3)
27 Dallas (3)
28 Arizona (3)
29 Buffalo (2)
30 Cincinnati (2)
31 Detroit (2)
32 Carolina (1)

Cheers.

NFL Week 12 (in the wild)

Playoff Picture:

AFC
1. NewEngland
2. Pittsburgh
3. Indianapolis
4. Kansas City
5. NYJets
6. Baltimore

NFC
1. Atlanta
2. Philadelphia
3. GreenBay
4. Seattle
5. NewOrleans
6. TampaBay

Estimated Probabilities of making the playoffs/conference champion/super bowl champion:

AFC.Playoff.Teams

Baltimore Cleveland Denver Houston Indianapolis

0.9706 0.0002 0.0046 0.0024 0.6040

Jacksonville KansasCity Miami NewEngland NYJets

0.3584 0.6874 0.0282 0.9950 0.9840

Oakland Pittsburgh SanDiego Tennessee
0.0830 0.9568 0.2334 0.0920

NFC.Playoff.Teams

Arizona Atlanta Chicago GreenBay NewOrleans

0.1446 0.9996 0.5944 0.8642 0.7922

NYGiants Philadelphia SanFrancisco Seattle StLouis

0.1460 0.9740 0.0332 0.6892 0.1338

TampaBay Washington
0.5752 0.0536

AFC.Champion

Baltimore Indianapolis Jacksonville KansasCity Miami

0.1668 0.0182 0.0046 0.0054 0.0004

NewEngland NYJets Pittsburgh SanDiego Tennessee

0.3490 0.2510 0.2018 0.0012 0.0016

NFC.Champion

Arizona Atlanta Chicago GreenBay NewOrleans

0.0002 0.5206 0.0246 0.0926 0.0704

NYGiants Philadelphia SanFrancisco Seattle StLouis

0.0042 0.2354 0.0002 0.0062 0.0004
TampaBay Washington
0.0434 0.0018
 

SB.Champion

Atlanta Baltimore Chicago GreenBay Indianapolis

0.2730 0.0854 0.0046 0.0308 0.0048

Jacksonville KansasCity Miami NewEngland NewOrleans

0.0010 0.0006 0.0004 0.2220 0.0214

NYGiants NYJets Philadelphia Pittsburgh SanDiego
0.0012 0.1452 0.0832 0.1116 0.0002

Seattle TampaBay Tennessee Washington
0.0014 0.0124 0.0006 0.0002

Cheers.

Early Sports Statistics (in the wild)

This was forwarded to me by S.J. It’s quite a long journey from the stuff in this video to some of the present day statistics like spatial aggregate fielding evaluation (S.A.F.E).

“[Recorded: circa 1959] – “The Electronic Coach” is a short film made by IBM describing the use of computers in the management of a university basketball team. The film features computer science legend Don Knuth, then a senior at Case Institute of Technology. For all four of his undergraduate years at Case (1956-60), Knuth was manager of the basketball team and sought ways to improve his team’s play by analyzing a series of special statistics he captured during games. The scoring method was unusual in the weightings it gave to activities not necessarily associated with traditional coaching but Knuth’s insights into basketball, combined with his computerization of the reams of data he collected, helped Case’s coaching staff make their basketball team a winner. The computer used is an IBM 650. ”

Cheers.

JSM (in the wild)

I’ve been in Vancouver the past week for the Joint Statistical Meetings (JSM) . Here is a collection of my thoughts and comments from the few days I was at the conference.

On Monday I went to the section on Survey Research Methods and saw Meena Khare, of the National Center for Health Statistics (NCHS) and Laura Zayatz of the United States Census give talks. They both spoke about measures that their institutions go through to release data to the public. The NCHS looks for uncommon combinations of variables that could be used to possible de-identify the data. Both organizations first remove obvious identifiers and then go on to make the released data more private. For instance, if the number of observations with a unique combination of variables in a data set is n and the number of observations with the unique combination is N, they would consider any combination of variables where n/N<.33 at risk for disclosure. At the U.S. Census, they have used something called data swapping to protect public release data sets in the last two Censuses (2000 and 2010). Along with data swapping, the Census will also be using partially synthetic data to maintain confidentiality to protect individual privacy in public release data in 2010.

Several things strike me about this.
-The methods that these organizations use to protect confidentiality are certainly going to increase privacy compared to a release of raw data, however, there doesn’t seem to be any way to know that what is being done is providing “enough” privacy.
-It’s clear that many different government organizations have issues which require some use of disclosure limiting techniques, however, it seems that each organization is creating its own rules and there is limited discussion going on between organizations to conceive of a standard policy for data sharing.
-There doesn’t actually seem to be any definition of what is considered a disclosure. For instance, if government data is released and I discover through some technique that someone definitely has HIV, then clearly a disclosure has taken place. However, if I use the same data to discover that someone definitely does not have HIV, a disclosure has still taken place, but the consequence is much less damaging. Furthermore, consider a situation where prior the the data release, I know a particular individual has a 50 percent chance of having HIV. After the data release, I can infer that there is a 99 percent chance that they have HIV. Clearly, I would consider this a disclosure. But what if the probabilities shift from only 50 percent (pre-data release) to 75 percent (post-data release) or 50 percent to 55 percent. At what point is “too much” information being released. It seems as if this issue receives less attention than is warranted.
-Finally, I believe that the ultimate solution to the disclosure problem is a careful combination of policy and disclosure limiting techniques. Policy issues include defining how much privacy must be maintained by a given technique, as well as, legal consequences for knowingly disclosing private information. Statistics has an obligation to provide increasingly improving statistical disclosure techniques along with metrics for measuring the privacy of a given technique.

Later on Monday, I saw the tail end of the talk by David Purdy titled “Statisticians: 3, Computer Scientists: 35”. The abstract for the talk was:
“John Tukey and Leo Breiman warned us that a day would come when statistics would need to focus more on computing, or risk losing good students to computer science. The Netflix Prize provides many examples of how our field needs to do more.

In the top 2 teams, participants with a computer science background vastly outnumbered those with a statistics background. There are a number of lessons that the field of statistics can learn from the fact that undergraduates in CS were well equipped to compete, while statisticians at all levels were not well prepared to implement advanced algorithms.

In this talk, I will address methodological issues arising with such a large, sparse dataset, how it demands serious computational talents, and where there is ample room for statistics to make contributions.”

I only saw the end of the talk, but I feel like I got the point. He notes how programs in statistics need to expose students to more aspects of computing. One quote from his talk that particularly shocked me was from a prominent statistician referring to the Netflix prize data set: (I’m paraphrasing) “I can’t do anything with the data, there is just too much of it.” (If anyone knows the actual quote, I would love to have it). Too little data may often be a problem, but too much data should be a blessing, rather than a curse.

When he is talking about computing he is referring to implementing complex algorithms to analyze the data, however, in my experience I have seen people struggle with simply managing data of this size. This is a simple problem to deal with, but, in my experience, I have both had this happen to myself and seen it happen to others. When I was in grad school working towards my master’s degree right out of undergraduate, we were given a problem in a consulting class with a “large” (several thousand observations) amount of data (well, “large” to someone with no experience managing data.) We (my group) knew exactly what we wanted to do with the data, but we are unable to manage the data in a way that would make it useful for analysis. So we did nothing. The moral of the story here is that, while we were taught well the techniques which were useful for analyzing the data, we were never taught and had never learned any useful data manipulation techniques, rendering our statistical educations useless. It was not until I got my first job that I learned, out of necessity, data management techniques including SAS data steps, SAS macros, and SQL.

When I returned to school to pursue a Ph. D., I saw many students with no work experience struggling through all of the same problems that I had with managing data. The same old “I know exactly what I want to do, but I can’t organize the data.” Often times in grad classes, books or teachers will describe a data set as “large” when it has several hundred or several thousand observations. This seems inadequate preparation for working in industry, as my first jobs often dealt with data sets with millions of observations and, later, a summer consulting project involved billions of observations.

Currently, there are no required computing or data management classes in my program for earning a Ph. D. in statistics. I think there should be a required class in every statistics program covering data management issues and, at least, a solid introduction to programming.

After, David Purdy’s talk, Chris Volinksy (Follow on Twitter) and he took questions. One interesting question that came up was about a second Netflix prize. However, Chris noted that this had to be cancelled because of privacy concerns. I’ve written before (or at least posted on Twitter) about some researchers who claim to have de-anonymized the data from the Netflix prize and, as a result, a lawsuit has been filed. (Netflix’s Impending (But Still Avoidable) Multi-Million Dollar Privacy Blunder) Whether you agree with canceling the prize over privacy concerns or not, it is clear that disclosure limitation is currently a big issue that certainly cannot be ignored.

On Tuesday, I went to one of the sports research sections and saw two talks before I left to go see a talk about partially synthetic data in longitudinal data. The first speaker, Shane Jensen, spoke about evaluating fielders abilities in baseball using a method he proposed called Spatial Aggregate Fielding Evaluation (SAFE). The previous link explains how their evaluation of players works and gives measures of performance for each player. Probably, the most shocking result of his work is that, averaged over 2002-2008, SAFE evaluated Derek Jeter as the worst shortstop who met the minimum number of balls in play (BIP). Alternatively, SAFE rates Alex Rodriguez as the second best shortstop over this same period, even though he now plays third to allow Jeter to play SS.

The next speaker was Ben Baumer, statistical analyst for the Mets (and native of the 413 area code). He spoke about his paper, “Using Simulation to Estimate the Impact of Baserunning Ability in Baseball“. One of the interesting things I took away from his talk is that he claims that players’ speed used to break up a double play is one of the important aspects of base running, but this is often largely or completely ignored as an evaluation tool of a players base running ability.

Before I end, I’d like to say thanks to all the speakers that I saw speak this past week and, finally, I’ll leave you with a view of Vancouver from the convention center.

Cheers.

NCAA Basketball Sweet Sixteen edition (in the wild)

Well there are 16 teams left and my bracket is in shambles.

Let’s review the predictions from last week.
My failures:
Two of my final four teams are gone (Villanova and Kansas) including my champion (Kansas) so I won’t win any office pools this year.

My triumphs:
-A lot of my predictions from my tournament preview came through including both of my predicted lower seed first round locks (Northern Iowa and Missouri).

-I had Northern Iowa in my top 25 . Certainly not higher than Kansas, but definitely a top 25 team.

-My model ranked Cornell 18th and I ignored it figuring it was a fluky part of the model. (Similar to how I have Oral Roberts ranked 8th and Sam Houston State ranked 2nd using the raw data.) I guess Cornell really is that good. This win over Wisconsin should shoot them up in my rankings when they come out tomorrow.

Some observations:
-Washington is really bad. They beat a mediocre Marquette team and then a New Mexico team that lost to San Diego State in its conference tournament. West Virginia is going to massacre Washington. (Note: Marquette lost 51-50 to DePaul this season. DePaul was 1-17 in the Big East. Yikes.)

-The Kentucky-Cornell game is going to be really interesting. Kentucky is loaded with super talented Freshman who are thinking about the NBA and millions of dollars and Cornell is stacked with seniors who are thinking about grad school next fall.

My picks:
-Kentucky over Cornell: Cornell keeps it close in the first half, but Kentucky’s superior talent takes over. They win by 10.
-West Virginia over Washington: West Virginia is going to kill them. I see them winning by 15-20 points.
-Duke over Purdue: This is an interesting match-up. Purdue had thoughts of a number 1 seed going into their conference tournament and they ended up with a 4. I’ll be interested to see how they play in this one.
-Baylor over Saint Mary’s: Baylor has beaten a 14 and an 11 seed. Saint Mary’s is a 10. I’m not sure what that means, but the total of the seed of Baylor’s first three opponents is 35. That has to be the highest total of a first three opponents, right? Anyway, this one will be close. Baylor by 3.
-Northern Iowa over Michigan State: I had Northern Iowa in top 25 at the beginning of the tournament and they are still there. I had Michigan State out of my top 25. Northern Iowa by 7. (Michigan State has made the Sweet Sixteen three years in a row.)
-Tennessee over Ohio State: How many games does Ohio State have to win in a row to convince me they are for real? 1 more. But I doubt it. Tennessee by 10.
-Syracuse over Butler: Syracuse by 5…..unless Shelvin Mack hits a million threes again. Then who knows.
-Kansas State over Xavier: Kansas State looks good. really good. (Note: Xavier has made the Sweet Sixteen three years in a row.)

Final Four picks:
Kentucky, Duke, Syracuse, and……(wait for it)…………Northern Iowa. Northern Iowa beat Kansas, surely they can beat Michigan State, and Ohio State or Tennessee, right?

Finals:
Syracuse vs Kentucky

Champion:
Kentucky 72-68

Can’t wait until next Monday when I get to see just how wrong I am again.

Cheers.