The New Netflix Prize (in the wild)
I know it’s been a while since I’ve posted. I’ve been using twitter (Follow StatsInTheWild)a lot more for the smaller posts.
Anyway, I’ve been preparing and studying for my general exam (you might call it a comprehenisive exam), and so I’ve been reading a lot about disclosure limitation, my disseration topic. In putting together a presentation explaining why disclosure control is necessary, I’ve listed two examples of really bad disclosures. The first was presented by Latanya Sweeney in her paper on k-anonymity . She took supposedly anonymized data released by the Group Insurance Commision (GIC) and, using publicly available voting records, identified former Massachusett’s Governor William Weld.
This is a huge problem in many areas of research. So many people rely on public released data for research, but organizations may be aprehensive to release their data due to privacy concerns. My second example involves the data from the Netflix prize. Narayanan and Shmatikov (2008), in their paper “How To Break Anonymity of the Netflix Prize Dataset”, use Netflix Prize data along with Internet Movie Database data.
They say in their abstract: “We present a new class of statistical de-anonymization attacks against high-dimensional micro-data, such as individual preferences, recommendations, transaction records and so on.
Our techniques are robust to perturbation in the data and tolerate some mistakes in the adversary’s background knowledge. We apply our de-anonymization methodology to the Netflix Prize dataset, which
contains anonymous movie ratings of 500,000 subscribers of Netflix, the world’s largest online movie rental service. We demonstrate that an adversary who knows only a little bit about an individual subscriber
can easily identify this subscriber’s record in the dataset. Using the Internet Movie Database as the source of background knowledge, we successfully identified the Netflix records of known users, uncovering
their apparent political preferences and other potentially sensitive information.”
So what is the problem with disclosing what movies someone is watching? It’s illegal. Check out Video Privacy Protection Act of 1988 (18 U.S.C. 2710 (2002)). So someone decided to sue Netflix. This article from the Privacy Law Blog has a great artcle summarizing the situation: Netflix Sued for “Largest Voluntary Privacy Breach To Date”.
From that article: “Plaintiffs argue this disclosure constitutes a sever invasion of their privacy by Netflix, which violates, among other things, the Video Privacy Protection Act of 1988 (18 U.S.C. 2710
(2002)). Additionally, the lead plaintiff in this case, Jane Doe, claims that Netflix’s disclosure of her movie rental history and ratings has and/or will ‘identify or permit inference of her sexual orientation… [which… ] would negatively affect her ability to pursue her livelihood and support her family, and would hinder her and her children’ ability to live peaceful lives within Plaintiff Doe’s community.'”
If you’re a lawyer, I’d love some quick comment on this case. So anyway, that’s some background and it brings me to my main point. I was going to suggest that Netflix run another Netflix prize. After
checking, they have already decided to do that.
So instead, I’ll just suggest what the second Netflix contest should be. This contest would be a prize for figuring out a way of releasing Netflix data in such a way that valid inference can still be made while, at the same time maintaining privacy. The format would be as follows: Netflix brings in experts in statistical disclosure limitation (for example, Jerry Reiter (His papers on the subject)) to create private versions of the Netflix data for release to the public (for example with Synthetic Data).
Say there were 10 experts. Netflix would put $100,000 dollars in escrow for each expert. Anyone who can demonstrate a privacy breach in any of the private data sets within 12 months gets the $100,000 of the expert who created the data set. If twelve month’s elapses, the expert keeps the $100,000 (let’s say they have to donate a portion to the charity of their choice.) The other part of the contest would be similar to the
first Netflix prize, but private data would be used in the modelling efforts. One suggestion I read for the second Netflix prize was predicting churn.
Whoever comes up with the best (by some appropriate criteria) model for predicting churn using the private data win’s $1,000,000.
Further,whichever expert’s private data was used to create the best model win’s another $100,000. So the expert’s have financial incentive to create private data that is as useful as possible. Users have two
incentives: 1) demonstrating a privacy breach or 2) improving churn models. They can work on either or both. If a privacy breach is ever demonstrated for a particular data set, all models for that data set are disqualified. Some potential problems to this proposal are defining what exactly a privacy breach is. It’s up to Netlfix to decide these details. Framing the contest this way will accomplish several goals. First, showing that they are concerned with privacy may earn them points with customers who are worried about such things. Second, if
they want to keep doing Netflix prizes, and I suspect they do, they are going to run into this privacy problem over and over again. By dealing with it now, they will be able to continue thei Netflix prizes by releasing useful data to the public for research in a private way. Further, if they demonstrate a way to release data with high utility and high privacy, other potential data releasing organizations could use the protocols that Netflix pioneered with their second Netflix prize.
Also, Netflix if you are reading this and you want to hire me, I could be lured away from grad school for the right price.