Oh, Canada / Privacy?
I was reading this article on Slate about a billionaire and his “de-facto spouse”. I came across this paragraph (emphasis added):
In 2002, “Eric” and “Lola” put an end to their decade-long de facto union. (These are pseudonyms used by the media, because Canadian law forbids the publication of the couple’s real names to protect the privacy of their children.) Lola, a Latin American woman, met Eric, a world-famous billionaire, when she was only 17 and he was 32. Although she wanted to get married throughout their relationship, Eric, who claims like many Quebecois that he “doesn’t believe in marriage,” refused.
Now, speaking as someone who wrote a dissertation about statistical disclosure control, this is not good. Many authors often give examples of why you can’t release certain characteristics about an individual if you want to maintain privacy. Common examples include releasing occupation information for someone who has a very rare occupation (e.g. Senator, Prime Minister) or releasing salary information on someone who has a very large salary. Knowing this type of in formation severely limits to potential pool of people who could possibly be the target. Further, the more details you have about an individual the easier it is to identify that person (e.g. age, location, gender, birth date).
Here is the information from just that one paragraph on Slate (I should be clear here, I am not criticizing slate at all; If Canada really values privacy, they need to think a little bit harder than this though):
- “Eric” is a billionaire (a famous one, even).
- “Eric” was 32 in 1992. (This means he was born in 1959 or 1960 and is about 52 now.)
- “Eric” is from Quebec.
- “Eric” is a man.
- “Lola” is Latin American.
- “Lola” was 17 in 1992.
- “Lola” dated “Eric” for ten years.
- “Lola” is a woman.
I’m not going to do this right now, but I bet I could you can figure out who “Eric” is in about 15 4 minutes (according to commenter Brandon) with nothing more than Google. (Here is a good place to start.)
Canada, here are some articles you might be interested in:
- Anonymized” data really isn’t—and here’s why not
- Exposed: The erosion of privacy in the Internet era
- What Information is “Personally Identifiable”?
Cheers.
MLB Rankings – 5/23/2012
StatsInTheWild MLB rankings as of May 23, 2012 at 11:51pm. SOS=strength of schedule
| Team | Rank | Change | Record | ESPN | TeamRankings.com | SOS |
| Texas | 1 | – | 27-18 | 3 | 2 | 14 |
| Baltimore | 2 | ↑3 | 28-17 | 2 | 1 | 5 |
| Toronto | 3 | ↑4 | 24-21 | 8 | 5 | 3 |
| TampaBay | 4 | ↑4 | 27-18 | 5 | 4 | 4 |
| LADodgers | 5 | ↓1 | 30-13 | 1 | 3 | 30 |
| Boston | 6 | ↑7 | 22-22 | 16 | 9 | 2 |
| Atlanta | 7 | ↓5 | 26-19 | 4 | 7 | 19 |
| NYYankees | 8 | ↑1 | 23-21 | 10 | 10 | 1 |
| St. Louis | 9 | ↓6 | 24-19 | 6 | 13 | 29 |
| Washington | 10 | ↓4 | 26-18 | 7 | 6 | 17 |
| Cleveland | 11 | ↑10 | 25-18 | 9 | 8 | 12 |
| ChicagoWSox | 12 | ↑7 | 22-22 | 17 | 15 | 13 |
| Miami | 13 | ↓3 | 24-20 | 11 | 12 | 16 |
| Seattle | 14 | ↑6 | 21-25 | 25 | 17 | 8 |
| LAAngels | 15 | ↑7 | 20-25 | 20 | 22 | 6 |
| Cincinnati | 16 | ↓4 | 24-19 | 13 | 11 | 28 |
| Oakland | 17 | ↓3 | 22-23 | 19 | 14 | 7 |
| Detroit | 18 | – | 20-23 | 18 | 18 | 11 |
| Philadelphia | 19 | ↓4 | 22-23 | 14 | 24 | 18 |
| Houston | 20 | ↓6 | 21-23 | 23 | 20 | 25 |
| NYMets | 21 | ↓10 | 24-20 | 15 | 16 | 15 |
| SanFrancisco | 22 | ↓5 | 23-21 | 12 | 19 | 27 |
| Kansas City | 23 | ↑6 | 17-26 | 24 | 23 | 10 |
| Arizona | 24 | ↓1 | 19-25 | 22 | 25 | 23 |
| Pittsburgh | 25 | ↓1 | 20-24 | 21 | 21 | 26 |
| Milwaukee | 26 | ↑1 | 18-26 | 26 | 28 | 24 |
| Colorado | 27 | ↓1 | 16-27 | 27 | 30 | 21 |
| Minnesota | 28 | ↑2 | 15-28 | 29 | 26 | 9 |
| San Diego | 29 | – | 16-28 | 30 | 28 | 20 |
| Chicago Cubs | 30 | ↓5 | 15-29 | 28 | 29 | 22 |
Past Rankings: 5/14/2012 5/7/2012 4/30/2012 4/23/2012 4/16/2012 4/13/2012 Cheers.
The Harvard Sports Analysis Collective
By Ben Blatt
In 1964, Mosteller and Wallace published Inference and Disputed Authorship: The Federalist. The paper used statistical analysis to try to determine if James Madison, Alexander Hamilton, or John Jay was the author of the unaccredited essays that were part of The Federalist Papers. They approached this historical mystery by using differences in word frequencies and Bayesian statistics. While controversial, similar methods have been used to investigate other authorship debates such as Shakespeare’s sonnets and plays. The same can be done for sports articles. While it would certainly be easier to look at the author’s name right underneath the title than to perform a statistical analysis of authors, I thought it would be fun anyways.
View original post 805 more words
Math teachers have been using the “German Tank Problem” for awhile to teach estimators. It goes something like this.
The Allies capture five German tanks. Suppose that the serial numbers on the tanks are 15, 23, 59, 83, and 109. Provide an estimate of the number of tanks that were produced.
You can see that your estimate would be a great deal lower than the estimate if the serial numbers were 1015, 2394, and 9438.
The story and accompaning math problem was written about in the Guardian. My favorite line, which turned out to be true, was this:
The statisticians believed that the Germans, being Germans, had logically numbered their tanks in the order in which they were produced.
It turns out that the statisticians were spot on:
By using this formula, statisticians reportedly estimated that the Germans produced 246 tanks per month between June 1940 and September 1942…
View original post 148 more words
I’d like to see density estimates added to the histograms, but this is interesting.
There are currently 174 books on my Amazon wishlist that I could order directly from Amazon. (My wishlist has a total of 195 books, but 21 are only available from other sellers.) Total price is approximately $3,549 (I rounded all prices to whole dollars), for a mean of approximately $20 per book.
But the median price of a book on my wishlist is (again to the nearest whole dollar) $16; the difference between the median and the mean is a hint that the distribution is skewed. And there are actually two peaks — one centered on $10 and one centered on $16-17. The distribution looks like this:
I’ve cut off the histogram at $100, which omits Mitchell’s Machine Learning at a list price of $168.16. Here’s a zoomed-in version omitting the 23 most expensive (all those over $30):
The two peaks are easy to explain: paperbacks and hardcovers, respectively. The…
View original post 82 more words

