Data Privacy in the Wild
I’m currently working on a systematic literature review of statistical issues of privacy. One of the papers I was reading today was from the Census. This paper discusses how how the Census maintains the privacy of individuals when it releases microdata to the public.
(I’ve been reading some papers which propose methods that are much more sophisticated and (hopefully) provide more privacy than the rather simple methods that the census mentions in this paper. I have no opinion on whether or not these methods are “sophisticated enough” to maintain privacy, I just think its interesting to see what the census is actually doing.)
The paper mention three techniques that the census currently uses to maintain privacy.
1.) release of data for only a sample of the population
2.) limitation of detail
Here is an explanation from the census as to why they use these techniques:
“The Census Bureau currently uses several standard techniques to mask microdata sets. The first is a release of data for only a sample of the population. Intruders (i.e., those who query the file for the sole purpose of identifying particular individuals with unique traits) realize that there is only a small probability that the file actually contains the records for which they are looking. The Bureau currently releases three public use samples of the decennial census respondents. One is a 1 percent sample of the entire population, the second a 5 percent sample, and the third a
sample of elderly residents. Each is a systematic sample chosen with a random start. None of these files overlap, so there is no danger of matching to each other. Most demographic surveys are 1-in-1000 and 1-in-1500 “random” samples. Generally the public use file for each survey contains records for each respondent. The second technique involves the limitation of detail. The Census Bureau releases no geographic identifiers which would restrict the record to a sub-population of less than 100,000. It also Arecodes@ some continuous values into intervals and combines sparse categories. Intruders must have extremely fine detail for other highly sensitive fields in order to positively
identify targets. The third technique protects the detail in sensitive responses in continuous fields. It is referred
to as top/bottom-coding. This method collapses extreme values of each sensitive field into a single value. For example, the record of an individual with an extremely high income would not contain his exact income but rather a code showing that the income was over $100,000. Similarly the low-income records would contain a code signifying the income was less than $0. In this example $0 is a bottom-code and $100,000 a top-code for the sensitive or high visibility field of income.” –CONTROLLED DATA-SWAPPING TECHNIQUES FOR MASKING PUBLIC USE MICRODATA SETS, Richard A. Moore, Jr.
The rest of the paper goes on to discuss a more sophisticated method of maintaining privacy called data swapping (first proposed by Dalenius and Reiss (1980)). If your not bored to tears already, then it’s probably worth reading about. I think it’s interesting.