Stats in the Wild

Auto-complete for “Rick Perry is” on the three big search sites on 8/29/2011.

Google	Yahoo	Bing
gay	an idiot	an idiot
an idiot	crazy	good
a rino	a scumbag	a crook
evil	not a conservative	bad
not a conservative	a republican	a scumbag
	evil	running for president
	awesome	right about education
	hot
	horrible
	a joke

Cheers.

Posted in Politics

3 Comments

Google auto-complete and Slate

Aug 28

Posted by statsinthewild

Recently, I posted (“Multidimensional Scaling, Republican Presidential Candidates, and ‘a douchebag” and “Tracking the Republican candidates via google auto-complete“) about Google auto-complete and potential Republican presidential candidates. Slate.com posted a good piece called “Google’s GOP Search Suggestion” (a day after my original post, I should note) where they look at the auto-complete for candidates names using a Google image search rather than a straight Google search.

Cheers.

Posted in Uncategorized

Leave a comment

Michele Bachmann: Internet search earch auto-complete terms

Aug 24

Posted by statsinthewild

The tables below are for the Google and Yaho o search “Michele Bachmann “(including a space after the last name) for the various dates indicated in the table. Each column has the date of the search and the top five Google or Yahoo auto-complete terms for the search.

Michele Bachmann – Google

8-17-2011	8-22-2011	8-23-2011	8-29-2011	8-31-2011 – present
quotes	quotes	quotes	quotes	quotes
corn dog	husband	husband	husband	husband
husband	bio	bio	bio	wiki
elvis	corn dog	slavery	slavery	husband gay
bio	slavery	husband gay	husband gay	hot

Michele Bachmann – Yahoo

8-24-2011	8-29-2011	9-1-2011	9-2-2011	9-7-2011
hot	hurricane	hurricane	new hair	campaign manager
for president	sarasota	irene	margaret thatcher	hurricane irene
minnesota	hot	hot	hot	hot
bio	for president	for president	for president	for president
feet	minnesota	minnesota	minnesota	minnesota

What does all this mean? I have no idea, but I suspect it will be difficult to win a Republican party nomination and then a general election with terms like “slavery” and “husband gay” attached to your name.

Another thought: I wonder if the political affiliations of users are constant across the three major search sites or are there a greater percentage of liberals on Bing than on Google, for instance. Could you use auto-complete terms to gain any insight into this? Or is this type of information perhaps already available?

Cheers.

Posted in Politics

3 Comments

Tags: 2012 Presidential Election, google, Michele Bachmann, Politics, Republican

Duck Duck Go

Aug 23

Posted by statsinthewild

My friend Scot recently sent me a g-chat about a new search engine DuckDuckGo. According to their website “DuckDuckGo is a general purpose search engine like Google or Bing.” They then offer four bullet points:

Get way more instant answers
Less spam and clutter
Lots and lots of goodies
Real Privacy

Those first three sound interesting, but what really piqued my interest was the fourth bullet: Real Privacy. DuckDuckGo will not collect any of your browsing information, which, in turn, could have been used to identify you and potentialyl reveal what you are searching for. Many people might not have a problem with this, but DuckDuckGo offers a very nice illustrated example of why this is potentially a problem. They go on to say in their privacy statement:

“It’s sort of creepy that people at search engines can see all this info about you, but that is not the main concern. The main concern is when they either a) release it to the public or b) give it to law enforcement. ”

“Why would they release it to the public? AOL famously released supposedly anonymous search terms for research purposes, except they didn’t do a good job of making them completely anonymous, and they were ultimately sued over it. In fact, almost every attempt to anonymize data has similarly been later found out to be much less anonymous than initially thought.”

That last line is particularly interesting. Two examples of this that immediately come to mind are the GIC insurance example and the Netflix prize example. All of these, the GIC, AOL, and Netflix examples, all released data to the public for research purposes. And in all of these examples, the releasing organization realized that they could not simply release the data to the public because of privacy concerns. They needed to do something to anonymize the data, so they did something (I’ve talked about this doing something before). But in all of these cases, the supposedly anonymous data all turned out to be, to varying degrees, less anonymous that originally thought. Simple ad hoc procedures like deleting information simply don’t work in protecting privacy unless you live under a rock and have no access to auxilliary information. The only way to be completely safe is to release no data at all. However, releasing nothing to the public prevents valuable research from being done. While GIC, AOL, and Netflix all released data that ended up being less than anonymous, you have to applaud their effort to allow researchers to do what they do: research. The Netflix prize produced plenty of valuable research (Lesson from the Netflix Prize Challenge) and the GIC data had the potential to produce valuable, potentially life saving public health research. Like most things in life, some balance must be found somewhere between the extremes, and the potential benefits of any research must be weighed against the potential costs of a privacy breach.

I see it like this: If you have something of value in your house, you wouldn’t leave the door wide open; you’d lock the door. But no matter what, if someone really wants to, they could break into your house with enough effort. Either way it’s still illegal/unethical. It’s the job of statisticians to put as many locks on the house as possible while still being able to reasonably use the house; it’s the lawyers job to prosecute people who break into the house, whether or not the door is well secured.

Cheers.

Posted in Privacy

Leave a comment

Multidimensional Scaling, Republican Presidential Candidates, and “a douchebag”

Aug 17

Posted by statsinthewild

If you don’t want to read this whole thing, just check out the graph: Multidimensional Scaling: Republican Candidates – 8/16/2011

I was having a conversation with some friends today and someone mentioned that Rick Perry might have problems in the election because there were rumors he was gay. So I went to google and typed in “Rick Perry is” and google kindly offered me the following auto-complete options: “gay”, “an idiot”, “a rino“, “evil”, “not a conservative”. This got me thinking how this compared with the other candidates google auto-completes. For instance, if you google “Mitt Romney is” you get suggestions like “a mormon” and ” an idiot” as well as three other suggestions. I did this for all of the major candidates (sorry Thaddeus) and recorded the five google auto-complete suggestions.

Then I created a vector for each candidate based on the google auto-complete words. Each candidate was an observation and each word was a variable. The candidate would get a 5 if the word was first on their list, a 4 if it was second, and so on with a 0 if the word was not mentioned in their auto-complete.

I then used multidimensional scaling (the cmdscale function in R) to allow me to visually display the relative positions of the candidates to each other. This all led to this graphic: Multidimensional Scaling: Republican Candidates – 8/16/2011. The location of the circles is based on multidimensional scaling, the size of the circle is relative to their standings in a national poll taken from fivethirtyeight.com, and the top five google auto-completes are displayed in or near the appropriate circle.

Some thoughts:

Every single candidate has the term “an idiot” in either the first or second auto-complete term
3 candidates were listed as “hot” (Palin. Bachmann, and Romney)
“stupid” was only used to describe women
Perry and Santorum (who has a much bigger google problem that anything I’ve listed here) had “gay” listed in their autocpmpletes and Pawlenty had “definitely not gay”
Bachman and Palins circles are nearly identical in size (11.7% ad 11.4%, respectively) and words (they share “an idiot”, “hot”, and “stupid”)
“a douchebag” appears in auto-completes for Santorum, Gingrich, and Pawlenty. I imagine it will be hard to win with this word attached to your name. (John Kerry couldn’t do it.)
The only overwhelmingly positive google auto-complete was for Herman Cain whose fifth auto-complete option was “awesome”

It can’t be good for Perry that he is so close to Pawlenty and Santorum, but he does have a significant amount of support at this point. I’ll be interested to see how these Google auto-completes changes over time and with the polls.

For information on how Google auto-complete works, click here.

Cheers.

Posted in Math Pictures, Politics

3 Comments

Tags: 2012 Presidential Election, Bachman, Herman Cain, Multidimensional Scaling, Palin, Pawlenty, Politics, Romney, ron paul, Santorum, Statistics

JSM 2011 – Statistical Disclosure Control via Suppression: Some thoughts

Aug 16

Posted by statsinthewild

I attended a session at JSM (in Miami. In August….) where the big topic was statistical disclosure control (SDC) via suppression in linked tables. As far as I can tell suppression is the most popular and widely used method of SDC for tables, but it seems that the application of this procedure is extremely ad hoc. Data disseminating organizations all have different rules for when they feel that a cell count or total is too small to be released, but it seems that this is all done by educated guessing as to what cell values are unsafe. I wonder if there are any formal guarantees that can be provided to individuals or organizations whose data is being disseminated in tables and protected via suppression? Certainly tables with sensitive values cannot simply be released to the public and something needs to be done to address the problem. Cell suppression is something and it certainly offers something in the way of protection (what is that something?). But I don’t know that cell suppression is more about an appearance of privacy (which is still very important), rather than actually providing privacy.

Of course, I’d love someone to respond to me and tell me that I am dead wrong and put my mind at ease.

Cheers.

Posted in JSM

Leave a comment

JSM 2011 review / Miami in August!

Aug 15

Posted by statsinthewild

I don’t know how many of you have ever been to Miami Beach in August, but it’s not exactly…..comfortable. But that’s where my quest for knowledge took me in the first week of August to the Joint Statistical Meetings (JSM) 2011.

I attended a really interesting session “Statistical Analyses of Judging in Athletic Competitions: The Role of Human Nature“. I missed the first talk about racial bias in Major League Baseball (MLB) umpires, but I caught the last three, which were all very interesting. John Emerson, who organized the session, presented “Statistical Sleuthing by Leveraging Human Nature: A Study of Olympic Figure Skating”. Ryan Rodenburg (Blog: Sports Law Analytics) presented his paper “Perception ≠ Reality: Analyzing Specific Allegations of NBA Referee Bias”. His approach was rather interesting. Rather than try to look for overall biases in NBA referees, he attempted to validate or invalidate specific allegations levelled against specific referees. His talk was followed by Kurt Rothoff who presented his paper “Bias in Sequential Order Judging: Primacy, Recency, Sequential Bias, and Difficulty Bias”, which focused on judging in gymnastics. One of the key findings of this work was, from his abstract, “Contestants who attempt higher difficulty increase their execution score, even when difficulty and execution scores are judged separately.” This makes me a little bit nervous because the better athletes are attempting the more difficult routines and, since they are better athletes, may receive higher execution scores because they are better athletes to begin with. Is there anyone out there who has ever been a competitive gymnast who has any thoughts on this?

Cheers!

Posted in JSM

Leave a comment

Chernoff Faces from aplpack

Aug 7

Posted by statsinthewild

I’ve been playing around with the faces function from the R package aplpack. I haven’t used it in a while, but there are some new features that I’ve either never noticed before or they are new. Color has been added to the faces and you can now plot the faces. There is also the superfluously fantastic option of displaying the faces as Santa Claus.

Here are some of my examples:

Golf: Statistics from several of my friends collected via oobgolf.com. (I’m SITW on the lower right.) The face is handicap, the mouth is scoring average, the eyes are average putts, the hair is the percentage of fairways hit, nose is greens in regulation (GIR), and ears are the total number of rounds you play. The faces are plotted with fairway percentage on the x-axis and GIR on the y-axis.

Santa_Golf: Same golf data with Santa option.

NFL2010: Final NFL regular season team statistics. The face represent the offense and the defense is represented by hair. The size of the nose indicates sacks, the ears indicate turnovers (ear width is interceptions; ear height is forced fumbles). The eyes indicate penalties and, finally, the size of the mouth indicates wins with a smiling face if the team made the playoffs (a really nice touch, if you ask me.) The face at the bottom right indicates the league leader.

Some observations on the NFL faces: The two superbowl teams last year (Pittsburgh and Green Bay) are both located at the bottom of the graph and there faces look very, very similar. San Diego looks similar to to both Green Bay and Pittsburgh (similar face, nose, eyes, and hair), but the big differences are the ears and, of course, the San Diego face is frowning. Another thing that pops out at me is how similar Houston and New England look to each other. They have very similar face shape, eyes, and hair. The big differences are the nose and ears (sacks and turnovers).

Cheers.

##NFL CODE

library(aplpack)

x<-read.csv(“\StatsInTheWild\NFL2010.csv”,header=TRUE)

x[33,]<-x[32,]

x$abbr<-sort(c(“NE”,”NYJ”,”Mia”,”Buf”,”Pit”,”Bal”,”Cle”,”Cin”,”Ind”,”Jac”,”Hou”,

“Ten”,”KC”,”SD”,”Oak”,”Den”,”Phi”,”NYG”,”Dal”,”Was”,”Chi”,”GB”,”Det”,”Min”,”Atl”

,”NO”,”TB”,”Car”,”Sea”,”StL”,”SF”,”Ari”,”ZZ”))

x$abbr[27:28]<-c(“SF”,”Sea”)

x$abbr[33]<-“League Leader”

x$lab<-paste(x$abbr,x$W,sep=”: “)

x$TOP<-as.numeric(substring(x$TOP.x,1,2))

##Playoff Teams: creating a playoff indicator

rows<-c(2,3,6,12,14,16,19,20,22,24,25,28)

x$playoffs<-rep(0,33)

x$playoffs[rows]<-1

##Finding the league leader in all variables

num<-sapply(x,is.numeric)

x[33,num]<-sapply(x[,num],max)

def<-c(6,22:23,26:29)

x[33,def]<-sapply(x[,def],min)

x$lab<-paste(x$abbr,x$W,sep=”: “)

##Defining the names

names(x)[c(2,3)]<-c(“Wins”,”Losses”)

names(x)[c(13,14,15,16)]<-c(“Off PPG”,”Off YPG”,”Off Pass”,”Off Rush”)

names(x)[c(22,23)]<-c(“Penalties”,”Pen Yards”)

names(x)[c(26:29)]<-c(“Def PPG”,”Def YPG”,”Def Pass”,”Def Rush”)

names(x)[c(5:6)]<-c(“Points For”,”Points Against”)

pdf(“/StatsInTheWild/NFL2010.pdf”,width=15,height=10)

##Columns used for plotting

x<- x[order(x[,4]),]

plot.cols<-c(5,6)

##Offense = face, Defense = hair, penalty= eyes, Wins and playoffs = mouth, turnovers = ears

##Columns used for faces: which columns am i going to use for the data

col<-c(15,16,14,2,2,41,22,23,28,29,27,36,36,30,32)

##creating the faces without plotting them.

a<-faces(x[,col],labels=x$lab,face.type=1,plot=FALSE)

##creating text for the legend

g<-paste(a[[2]][,1],a[[2]][,2],sep=”: “)

##building the plot

plot(x[,plot.cols],bty=”n”,xlim=c(200,600),main=”2010 NFL Season”)

text(rep(540,15),seq(475,325,length.out=15),g)

##plotting the faces

plot.faces(a,x[,plot.cols[1]],x[,plot.cols[2]],width=30,height=30)

dev.off()

Posted in Math Pictures, Sports

1 Comment

Defense

Jul 19

Posted by statsinthewild

Tomorrow I defend my dissertation. If all goes well, you will all have the opportunity to finally call me doctor; I know you are just as excited about this as I am. Hopefully, I will have more time to write blog posts post-defense.

Cheers.

Posted in Uncategorized

1 Comment

Dissertating (in the wild)

May 20

Posted by statsinthewild

Well, I’m almost done with my dissertation, which means I’m almost done with my Ph. D. And when I say done, I mean it in both the senses of “finished” and “sick of”. I have a nearly complete document AND a defense date. Now all I have to do is put the most important skill I learned in grad school to good use: finding and filling out paperwork. Anyone can write a 100+ page dissertation filled with original thoughts, but only the best and brightest can jump through all of the bureaucratic hoops to actually complete the degree.

Anyway, I really enjoyed my dissertation topic, which, I hear, is not something that everyone experiences. I’ll eventually come back to the topic (statistical disclosure limitation), but I really just need some time away from it. I’ll get my wish as I’ll be starting a post-doc this summer researching statistical genetics, which I am probably a little over excited to start.

Cheers.

Posted in Uncategorized

1 Comment

Stats in the Wild

Tracking Republican presidential candidates via online search auto-complete – 8/29/2011

Google auto-complete and Slate

Michele Bachmann: Internet search earch auto-complete terms

Duck Duck Go

Multidimensional Scaling, Republican Presidential Candidates, and “a douchebag”

JSM 2011 – Statistical Disclosure Control via Suppression: Some thoughts

JSM 2011 review / Miami in August!

Chernoff Faces from aplpack

Defense

Dissertating (in the wild)

Blogroll

Comedy

Data Art

Data Viz

Jobs

R

Tag Cloud