Category Archives: Uncategorized
March Madness in the Wild
I was sitting in my office last semester and this really tall guy walked past my door. I kind of did a double take he was so tall. A few seconds later he came back and knocked on my door. He asked if my office mate was there so he could hand in some homework. I told him that he wasn’t around, but I could take the homework and give it to him when I saw him. He thanked me and left. It was Hasheem Thabeet. So in honor of Mr. Thabeet handing in his STAT 1100 homework to me, I have started a Yahoo! tourney pick’em group.
Join the Stats in the Wild Yahoo! Tournament Challenge. Join group #166142. Good luck.
Every year in March I start to care about college basketball. I even build some half-assed model’s to do my bracket. I finished in the 99th percentile of all yahoo brackets last year by managing to pick the champion, the two finalists, and all four final four teams. (My goal for a long time was to pick all four of the final four. I’ve been stuck on three out of four since high school.) Anyway, in order to make my predictions, I build several models and then I combine the results.
Here’s some thoughts on the tournament based on those models:
Most overrated teams in the tournament:
1. LSU: The SEC is a great football conference and a mediocre basketball conference, (at least this year) and LSU is leading the charge of mediocre teams in this conference. LSU only wins their first round game because they play Butler. Speaking of Butler…..
2. Butler: They lost to Wisconsin Green Bay, Wisconsin Milwaukee and Loyola Chicago. Then they lost to Cleveland State in their tournament. Although in their defense they did beat IU-South Bend 87-33. They played two top 25 teams all year and they went 1-1. Ohio state (who is no longer in the top 25) and Xavier (another over rated team). Speaking of Xavier…..
3. Xavier: Look. I’m a Umass fan, (from the days of Roe, Camby, Dana Dingle, Donta Bright, Derek Kellogg) but the A-10 stinks. Xavier deserves to be in the tourney, but as 4 seed? I doubt it. This is the team the closed it’s regular season 4-4 with losses to Duquesne, Dayton, Charlotte, and Richmond, then lost to Temple in their conference tournament. I’m not exactly inspired by the musketeers. Florida State beats them in the second round by 10.
4. Dayton: An at large bid? So Dayton is better than St. Mary’s, Penn State, Florida, Auburn, Creighton, and Miami? That’s what the selection committee is saying.
5. Boston College: USC should dispose of them in the first round. USC by 12.
Most Underrated teams in the tournament
West Virginia: I’ve got West Virginia as a Sweet sixteen team. They are going to dismantle Dayton in the first round, then beat Kansas by 2. Then they have a good chance against Michigan State for a shot at the Elite 8.
Wisconsin: I think they got screwed with a 12 seed, and then by drawing a very good FSU 5 seed. If they can get past this game, they should cruise into the sweet sixteen past Xavier.
First round upsets: USC over BC.
Most likely 13 seed to win first round: Cleveland State
Most likely 12 seed to win first round: Arizona
Most likely 11 seed to win first round: Utah State
Most likely 10 seed to win first round: USC
Most likely 9 seed to win first round: Butler (Only because LSU stinks too.)
Biggest 3 seed blowout: Missouri
4 seed: Washington
5 seed: Illinois
6 seed: West Virginia
7 seed: Cal
8 seed: Ohio State
First number 1 seed gone: UConn
Good first round bets:
Washington -220
FSU -145
Utah -110
Illinois -200
Arizona St -200
Michigan +190 (This price is fantastic)
Texas A and M +115
Oklahoma St +115 (Seriously? They’re an inderdog here?)
Ohio St -160
And finally, the offical final 4 picks of Stats in the Wild (Drumroll Please):
UNC, Pitt, Memphis, and Lousiville with UNC beating Memphis in the finals 76-75.
Also, I am picking San Diego State to win the NIT with a 72-67 win over Miami in the finals.
And for my boys in the NIT…..what? Thy’re not even in the NIT. Bring back Calipari!
Go.
Go U.
Go U – Mass.
Go Umass!
Cheers.
ENAR in the wild
I’m sitting at Bradley International Airport waiting for a flight to Raleigh/Durham. Then I’ll hop on a connector to Memphis, followed by another connector to San Antonio, Texas. What’s in San Antonio, you might ask? The Alamo? Tim Duncan? Well yes, but why would I post that on a stats blog?
The reason I am headed to the Lone Star state is for the Eastern North American Region (ENAR) (pronounced EE-NAR) of the International Biometric Society (IBS). (IBS is an unfortunate acronym in itself, but I should note that the Western North American Region (WNAR) is pronounced WEE-NAR. That is unfortunate. I am also extremely childish.) I’ll be there from tonight until Monday afternoon, so I can attend a few talks in the morning before I leave.
On SUnday night from 8-11pu, I’m giving a poster presentation about synthetic data related to the paper I just published. It’s like a student poster bonanza. And they serve alcohol. So it should be interesting.
I’d love to post the abstracts to some of the talks I am interesting in attending, but Bradley’s internet connection is not letting me view the abstract page.
If Raleigh/Durham or Memphis has a better connection, I’ll post them when I get there.
Cheers.
Probability in the Wild
This isn’t really “statistics” or “in the wild” since it’s a problem from an advanced probability class I am taking, but it’s a neat problem so you’ll just have to live with.
Here is the game:
I get a die and the opponent gets a die. (These dice are not necessarily numbered 1,2,3,4,5,6. The sides could be numbered 1,1,1,3,3,3, for example, but the dice has six sides.) We both roll. If you’re number is higher than my number, you win a dollar. If my number is higher than your number, I win a dollar.
Can you design three six sided dice in such a way, that if the opponent chooses his die first, I can always choose a die from that remaining two that makes my expected winnings positive?
STOP HERE IF YOU WANT TO TRY THIS YOURSELF!

You probably didn’t even try. So here’s an answer.
Here are three possible dice:
A: 0 0 6 6 6 6
B: 4 4 4 4 4 4
C: 2 2 2 2 8 8
Dice A will beat Dice B with probability 2/3.
Dice B will beat Dice C with probabilty 2/3.
Dice C will beat Dice A with probabilty 5/9.
(Pr(C>A)=P(C>A|A=0)*P(A=0)+P(C>A|A=6)*P(A=6)=1*1/3+1/3*2/3=5/9)
So if I let my opponent choose first, I can always make a choice that will make my expectation positive.
The moral:
Make these dice, head to a bar, and fleece some drunks ( Just don’t tell them I sent you). I guess that’s the “in the wild” part.
Cheers.
Shark, the Global Financial Crisis, and Correlation (in the wild)
I was reading the fantastic blog The Bernoulli Trial today, (the Stock Market Seismometer is very interesting.) and I stumbled onto one of his links which brought me to The Mr Science Show. All of the posts are very interesting, but one post seemed particularly relevant to this page.
One of the topics I have talked about in the past is correlations between two variables when the correlation is just a coincidence (like me being in grad school and the Red Sox winning the World Series). Here is a post on Mr. Science Show which relates the global financial crisis to the number of fatal shark attacks in the world. In fact, this correlation has won the prestigious Mr. Science Show Correlation of the Week. Congratulations.
Cheers.
Shirts in the Wild
Having trouble dressing yourself and love statistics? Solve your problems with T-Shirts with witty statistics slogans from SassyStatistics.com (The link was sent in to statsinthewild by avid reader Kevin S. Bagge.)
My favorite would be: “One way or ANOVA, I’m gonna getcha.”
(Note: I am in no way financially involved with SassyStatistics.com)
(Note: If Sassy statistics wants to pay me, then I will gladly be involved with SassyStatistics.com)
Cheers.
Data Privacy in the Wild
I’m currently working on a systematic literature review of statistical issues of privacy. One of the papers I was reading today was from the Census. This paper discusses how how the Census maintains the privacy of individuals when it releases microdata to the public.
(I’ve been reading some papers which propose methods that are much more sophisticated and (hopefully) provide more privacy than the rather simple methods that the census mentions in this paper. I have no opinion on whether or not these methods are “sophisticated enough” to maintain privacy, I just think its interesting to see what the census is actually doing.)
The paper mention three techniques that the census currently uses to maintain privacy.
1.) release of data for only a sample of the population
2.) limitation of detail
3.) top/bottom-coding
Here is an explanation from the census as to why they use these techniques:
“The Census Bureau currently uses several standard techniques to mask microdata sets. The first is a release of data for only a sample of the population. Intruders (i.e., those who query the file for the sole purpose of identifying particular individuals with unique traits) realize that there is only a small probability that the file actually contains the records for which they are looking. The Bureau currently releases three public use samples of the decennial census respondents. One is a 1 percent sample of the entire population, the second a 5 percent sample, and the third a
sample of elderly residents. Each is a systematic sample chosen with a random start. None of these files overlap, so there is no danger of matching to each other. Most demographic surveys are 1-in-1000 and 1-in-1500 “random” samples. Generally the public use file for each survey contains records for each respondent. The second technique involves the limitation of detail. The Census Bureau releases no geographic identifiers which would restrict the record to a sub-population of less than 100,000. It also Arecodes@ some continuous values into intervals and combines sparse categories. Intruders must have extremely fine detail for other highly sensitive fields in order to positively
identify targets. The third technique protects the detail in sensitive responses in continuous fields. It is referred
to as top/bottom-coding. This method collapses extreme values of each sensitive field into a single value. For example, the record of an individual with an extremely high income would not contain his exact income but rather a code showing that the income was over $100,000. Similarly the low-income records would contain a code signifying the income was less than $0. In this example $0 is a bottom-code and $100,000 a top-code for the sensitive or high visibility field of income.” –CONTROLLED DATA-SWAPPING TECHNIQUES FOR MASKING PUBLIC USE MICRODATA SETS, Richard A. Moore, Jr.
The rest of the paper goes on to discuss a more sophisticated method of maintaining privacy called data swapping (first proposed by Dalenius and Reiss (1980)). If your not bored to tears already, then it’s probably worth reading about. I think it’s interesting.
Cheers.
R in the wild
Here is an interesting blog post about how Google and Facebook are using the R (Free statistical software) at their companies.
Cheers.
The New Netflix Prize (in the wild)
I know it’s been a while since I’ve posted. I’ve been using twitter (Follow StatsInTheWild)a lot more for the smaller posts.
Anyway, I’ve been preparing and studying for my general exam (you might call it a comprehenisive exam), and so I’ve been reading a lot about disclosure limitation, my disseration topic. In putting together a presentation explaining why disclosure control is necessary, I’ve listed two examples of really bad disclosures. The first was presented by Latanya Sweeney in her paper on k-anonymity . She took supposedly anonymized data released by the Group Insurance Commision (GIC) and, using publicly available voting records, identified former Massachusett’s Governor William Weld.
This is a huge problem in many areas of research. So many people rely on public released data for research, but organizations may be aprehensive to release their data due to privacy concerns. My second example involves the data from the Netflix prize. Narayanan and Shmatikov (2008), in their paper “How To Break Anonymity of the Netflix Prize Dataset”, use Netflix Prize data along with Internet Movie Database data.
They say in their abstract: “We present a new class of statistical de-anonymization attacks against high-dimensional micro-data, such as individual preferences, recommendations, transaction records and so on.
Our techniques are robust to perturbation in the data and tolerate some mistakes in the adversary’s background knowledge. We apply our de-anonymization methodology to the Netflix Prize dataset, which
contains anonymous movie ratings of 500,000 subscribers of Netflix, the world’s largest online movie rental service. We demonstrate that an adversary who knows only a little bit about an individual subscriber
can easily identify this subscriber’s record in the dataset. Using the Internet Movie Database as the source of background knowledge, we successfully identified the Netflix records of known users, uncovering
their apparent political preferences and other potentially sensitive information.”
So what is the problem with disclosing what movies someone is watching? It’s illegal. Check out Video Privacy Protection Act of 1988 (18 U.S.C. 2710 (2002)). So someone decided to sue Netflix. This article from the Privacy Law Blog has a great artcle summarizing the situation: Netflix Sued for “Largest Voluntary Privacy Breach To Date”.
From that article: “Plaintiffs argue this disclosure constitutes a sever invasion of their privacy by Netflix, which violates, among other things, the Video Privacy Protection Act of 1988 (18 U.S.C. 2710
(2002)). Additionally, the lead plaintiff in this case, Jane Doe, claims that Netflix’s disclosure of her movie rental history and ratings has and/or will ‘identify or permit inference of her sexual orientation… [which… ] would negatively affect her ability to pursue her livelihood and support her family, and would hinder her and her children’ ability to live peaceful lives within Plaintiff Doe’s community.'”
If you’re a lawyer, I’d love some quick comment on this case. So anyway, that’s some background and it brings me to my main point. I was going to suggest that Netflix run another Netflix prize. After
checking, they have already decided to do that.
So instead, I’ll just suggest what the second Netflix contest should be. This contest would be a prize for figuring out a way of releasing Netflix data in such a way that valid inference can still be made while, at the same time maintaining privacy. The format would be as follows: Netflix brings in experts in statistical disclosure limitation (for example, Jerry Reiter (His papers on the subject)) to create private versions of the Netflix data for release to the public (for example with Synthetic Data).
Say there were 10 experts. Netflix would put $100,000 dollars in escrow for each expert. Anyone who can demonstrate a privacy breach in any of the private data sets within 12 months gets the $100,000 of the expert who created the data set. If twelve month’s elapses, the expert keeps the $100,000 (let’s say they have to donate a portion to the charity of their choice.) The other part of the contest would be similar to the
first Netflix prize, but private data would be used in the modelling efforts. One suggestion I read for the second Netflix prize was predicting churn.
Whoever comes up with the best (by some appropriate criteria) model for predicting churn using the private data win’s $1,000,000.
Further,whichever expert’s private data was used to create the best model win’s another $100,000. So the expert’s have financial incentive to create private data that is as useful as possible. Users have two
incentives: 1) demonstrating a privacy breach or 2) improving churn models. They can work on either or both. If a privacy breach is ever demonstrated for a particular data set, all models for that data set are disqualified. Some potential problems to this proposal are defining what exactly a privacy breach is. It’s up to Netlfix to decide these details. Framing the contest this way will accomplish several goals. First, showing that they are concerned with privacy may earn them points with customers who are worried about such things. Second, if
they want to keep doing Netflix prizes, and I suspect they do, they are going to run into this privacy problem over and over again. By dealing with it now, they will be able to continue thei Netflix prizes by releasing useful data to the public for research in a private way. Further, if they demonstrate a way to release data with high utility and high privacy, other potential data releasing organizations could use the protocols that Netflix pioneered with their second Netflix prize.
Also, Netflix if you are reading this and you want to hire me, I could be lured away from grad school for the right price.
Cheers.
As a Netflix user I recommend Big Fan , did not like The 39 Steps the next movie in my queue is Funny People.
College in the Wild
Here is an interesting article titled “The Great College Hoax” which was emailed to me by an avid reader.
Here is an excerpt from the article:
“College graduates will earn $1 million more than those with only a high school diploma, brags Mercy College radio ads running in the New York area. The $1 million shibboleth is a favorite of college barkers.
Like many good cons, this one contains a kernel of truth. Census figures show that college grads earn an average of $57,500 a year, which is 82% more than the $31,600 high school alumni make. Multiply the $25,900 difference by the 40 years the average person works and, sure enough, it comes to a tad over $1 million.
But anybody who has gotten a passing grade in statistics knows what’s wrong with this line of argument. A correlation between B.A.s and incomes is not proof of cause and effect. It may reflect nothing more than the fact that the economy rewards smart people and smart people are likely to go to college. To cite the extreme and obvious example: Bill Gates is rich because he knows how to run a business, not because he matriculated at Harvard. Finishing his degree wouldn’t have increased his income.” – The Great College Hoax, Kathy Kristof 02.02.09, 12:00 AM ET
This is a good point, but I wouldn’t say that the statement “College graduates will earn $1 million more than those with only a high school diploma” only contains “a kernel of truth”. I would say that it is a completely accurate statement. College graduates do earn more than those with a high school diploma. She herself cites the census as proof of this statement.
She also points out that just because Bachelor’s degrees are correlated with higher incomes doesn’t necessarily mean that Bachelor’s degrees cause higher income. That’s true. But isn’t it at least possible that Bachelor’s degrees do cause higher income?
And, finally, isn’t her argument about understanding the difference between correlation versus causation a subtle argument for more people to take statistics classes? Maybe in college? Because education is important? It’s likely that she, herself, learned of this concept in a statistics class at her alma mater, the University of Southern California.
The bottom line here is that statistics are important for navigating the modern world that we live in and a great place to learn about statistical concepts (as well as many other worth academic pursuits) is in college. In learning about these, however, no student should have to go so far into debt that they are financially ruined for the next ten, twenty, thirty years and even possibly the rest of their lives.
Cheers.
Good Jobs in the Wild
I was reading thebernoullitrial blog and came across this link to an article in the Wall Street Journal about the best and worst jobs in the United States.
Number 3!
Cheers.