Genomic Privacy #jsm2014
This is the first of the blog posts in my “backblog” pertaining to #jsm2014.
My dissertation work was in statistical disclosure control and the post-doc work was in genetics. Almost immediately after starting my post-doc, I realized that privacy issues and genetic data seem to go hand in hand. (I recently submitted and R21 to the NIH about using synthetic data to protect privacy in genome-wide association data. It was not funded. I will re-submit.) Anyway, the point is when I saw this session entitled “Genomic Privacy: Risk and Protection Methods”, I absolutely had to go to it. The talks were all fantastic.
Here is what they say about CDP in their abstract:
We introduce Concentrated Differential Privacy (CDP), a relaxation of Differential Privacy geared towards improved accuracy and utility.
Like Differential Privacy, Concentrated Differential Privacy is robust to linkage attacks and to adaptive adversarial composition. In fact, it composes as well as Differential Privacy, while permitting (significantly) smaller noise in many setting.
This seems like a good step forward for differential privacy and attempts to address some of the very real issues with the method.
Guy’s introductory slides were a fantastic explanation of the problem at hand and he made a lot of really interesting points. One of the reports he mentioned in his intro slides was the big data review by the federal government. I haven’t finished reading all of this yet, but what I have read is really interesting. Check out the report here: Big Data Review (This report deserves its own blog post.)
Here is a list of some other points that I write down as quickly as I could during his talk:
- Anonymization (removing identifying attributes) does not seem robust in the context of big data. Defeated by the presence of big data. (Netflix Prize as an example of failed anonymization.)
- Concerns 1.) linkage attacks via partial information (ALL adversaries have partial information about us)
- Concern 2.) Composition: Each query is private in isolation, but not necessarily in multiple analyses
- Concern 3.) “Just release statistics”: Attacks include differencing, big band attacks. (e.g. Query 1: How many sickle cell individuals in DB Query 2: How many sickle cell individuals not names Guy Rothblum. Difference of these queries yields the status of Guy Rohtblum.)
- Intuition of differential privacy: “Bad things can happen happen but not because you participate in the data base.”
- Advantages of differential privacy: 1.) quanitifiable 2.) handles linkage attacks/auxiliary data
- Concentrated differential privacy improves the noise addition by an order of magnitude. Better accuracy, mildly relaxed privacy.
- “A social choice must be made” This is a great point. Once we can quantify privacy, which we don’t all agree on yet, we need to have a discussion about how much privacy we want.
The second speaker was Bradley Malin from Vanderbilt who spoke about “Anonymization of Phenotypic data to support genotype association studies”
Some bullet points from his talk:
- Two quotes I took away from his first few slides were “Hurdle not Fort Knox” and “Possible doesn’t imply probable”. In terms of the first quote my boss always describes this in terms of breaking into a house: Just because it’s illegal to break into someone else’s home, doesn’t mean we don’t lock our doors. But you don’t necessarily need bars on the windows either. We can’t simply rely on the law to deter adversarial data users, but we also don’t need to go over board.
- “Often we use very strong adversary models. But almost perfect results can be achieved…. in the real world. We must be ‘reasonable and practical‘” I am totally guilty of this. A few of my articles on the topic make very strong assumptions about what the adversary knows (often “worst case scenario). My more recent papers are relaxing these worst case scenario assumptions.
- Examples of things that could potentially identify an individual: demographics,diagnosis codes, laboratory, DNA, location visits, movie reviews.
- Malin introduces a procedure named UGACLIP for anonymization in a GWAS setting.
- Malin talked about how stakeholders should some how participate in the decision as to how much privacy they think they should have, but many people have no idea how to interpret the numbers. (i.e. Average person on the street doesn’t have any idea how secure 5-anonymization is)
- When sharing data with NIH the generally accepted value of k (as in k-anonymzation) is 5.
Speaker 3 was Fei Yu of Carnegie Mellon who presented joint work with Stephen Fienberg. Fei spoke about “Scalable Privacy preserivng data sharing methodologies for GWAS.” (Full article on arXiv).
- One of the big privacy concerns of GWAS is that even aggregate statistics from a GWAS (MAF, χ² statistics, regression coefficients) do not provide perfect privacy. Homer et al. (2008) showed that an intruder may be able to infer that someone has participated in a GWAS. This caused the NIH to review its data sharing policy of GWAS data.
- Fei presented work on how to share the top M SNPs (in terms of their significance) and achieve ε-Differential Privacy.
Finally, Hae Kyung Im of the University of Chicago spoke about the “On Sharing Quantitative Trait GWAS Results in an Era of Multiple-Omics Data and the Limits of Genomic Privacy“. Since I am now a resident of the midwest, I meant to try to meet Hae Kyung Im, but at the end of the session I bumped into someone else and ended up missing her (this happens all the time at JSM. You try to do one thing, but then you bump into someone you haven’t seen in 5 years.)
- “For full advantage, broad sharing of data of results is needed; Must be careful about privacy.” I totally agree with this. There is so much potential benefit to sharing this type of data that we can’t just lock it up in a database and throw away the key.
- “Summary studies are considered safe, BUT with GWAS studies we may have millions of SNPS”. With big data our previous ideas about what is safe to release need to be re-avaluated. She again cited Homer et al. (2008) noting based on that article: Even if the DNA sample was a mixture of 1000 or more individuals, they were able to determine with high accuracy whether they were in the sample or not. This is “Great for forensics, but has consequences fro GWAS”.
- The question she posed in terms of GWAS is basically “Can we publish regression coefficients?” I was going to try to summarize her results, but I can’t interpret what I wrote about her results, so I’ll just wait until I get her slides. (I tweeted (Twitter is awesome) at her, hopefully she will be kind enough to share them.)
A discussion then ensued, which led Stephen Fienberg to make (roughly) the following statement:
I think that there is a misconception. In very high dimensional problems as in GWAS with the auxiliary data every individual is unique. You must find something to share. We cannot go on saying we can’t share anything. If I were on your IRB there is no issue with the faculty doing this. The only issue is what do they publish on what they have done. And I don’t think there is an IRB that understands how to do that in the entire country.
One final thought: I was struck by how few people were in this session. This
will be is already a big deal. Right now with just a single hair, someone can reproduce your face up to a family resemblance. Who knows what we’ll be able to do in 5, 10 or 50 years? Privacy is a big deal.