Category Archives: Uncategorized

Predator X in the Wild

I was watching the show “Predator X” on the History Channel tonight. Apparently, they discovered this fossil of an enormous aquatic predator. It’s pretty awesome.

pliosaur-vs-plesiosaur

Here is a description of the bite force of this predator: (from here)
“At St Augustine Alligator Farm and Zoological Park in Florida , Dr. Hurum assisted evolutionary biologist Dr. Greg Erickson from Florida State University in calculating the bite force of this colossal creature. The jaws held in place a set of trihedral teeth, each measuring 12 inches, which clamped down on prey with an estimated 33,000lbs of bite force. The calculation is one of the largest bite forces ever calculated for any creature. Predator X would have had a bite a bite force was more than ten times the bite force of any animal alive today and four times the bite force of a T- Rex.”

(Here is a link of average bite forces for humans and a few selected animals.)

At this point you might be saying, “That IS awesome. But what does it have to do with this blog?” An astute observation. Well…

They estimated that the bite force of the predator was 33,000 pounds. The way they estimated this was by taking measurements of the bit force of different sized crocodiles (or alligators, I can never tell the difference.) Then they plotted the data in a scatter plot, weight of crocodile versus bite force. There was a clear postive relationship between bite force and size of the animal. Then they fit a simple regression line through the data and extrapolated how much bite force a 50 foot long animal that weighed an estimated 45 tons could pack in its bite force. That’s how I believe they came up with there estimated bite force. I’ll give them the benefit of the doubt and assume they did more than that to come up with the estimate but they didn’t want to show the details on the History channel. (If you have details on how they estimated the bite force, please send them my way.)

Let’s assume that all they did was extrapolate this simple regression line. What would be the problem with that? The problem is that they are extrapolating the linear trend outside of their domain. There is no guarantee that the bite force trend remains linear as the weight approaches the estimated 45 tons. They collected their data on animals with weight of crocodiles which can be up to 1.5 tons. It seems naive to assume that the linear trend will continue as you increase the weight of an animal.

Here is a simple example of why extrapolating outside of your domain is a bad idea.
Say you collected data of children’s heights and weights and you fit a regression line through the data. You’ll surely observe a postive relationship between age and height. As children get older, their heights generally increase. This increase can be approximated by a roughly linear trend say between the ages of 10-18. Also, say that we find that children grow on average of 1 inch per year between 10-18. If I were then to predict the height of a person by extrapolating out this trend I would assume that a 48 year old would be, on average, 30 inches taller than an 18 year old. Clearly, this is not true.

So just because a trend is linear over a certain domain does not mean that that linear trend continues outside of the tested domain.

Cheers.

Choosing your bracket in the wild.

Check out page 35 of the old issue of Chance magazine which has an article talking about the best way to pick NCAA basketball teams to win your office pool. Of course this probably would have been more of a help if I had posted it five days ago, but it’s still interesting. And you can use the advice next year.

Cheer

ENAR in review in the wild

So I’m back from ENAR and back from spring break. I’ve ben greeted back to grad school by a midtern on Friday night from 6-8pm. What a fun time for a midterm!

Let me first start by saying that San Antonio is awesome. The river walk is great. I at at two great Mexican restaurants for lunch two days in a row and they were both incredible. And I drank Shiner Bock the whole time, which I highly recommend.

Anyway, on Sunday night I presented a poster at ENAR (they served Shiner Bock during the poster presentations) about a paper that we wrote (and recently published) about synthetic data with binary variables. I met some very interesting people who stopped by my poster. One guy who stopped by informed me that he coined the term predictive mean matching (which i referenced on my poster). So, I asked him who he was, and he told me he was Rod Little (He’s kind of a big deal). He wrote the book on multiple imuptation: Little, R.J.A. & Rubin, D.B. (1987). Statistical Analysis with Missing Data. New York:
John Wiley. So that was kind of neat. (I just visited Rod Little’s website and apparently this is the “most useful of all links”. (Here is another interesting article called “Calibrated Bayes: A Bayes/Frequentist Roadmap”.)

The next day I spoke with some people from SAS and STATA, as well as, some recruiters from Smith-Hanley (who I got my first job through) and Cambridge Group. The SAS people told me about a product call JMP, which I was very impressed by it. The STATA people told me that I could buy a student STATA license for like $55 dollars and then use it commerically after I graduate. (As opposed to several thousand dollars for a SAS license that only lasts a year.) And I could use it for as long as I wanted to. The only thing I would have to pay fo rwould be upgrades. So STATA has that going for them. I am definately gonig to try it out.

Cheers.

March Madness in the Wild

I was sitting in my office last semester and this really tall guy walked past my door. I kind of did a double take he was so tall. A few seconds later he came back and knocked on my door. He asked if my office mate was there so he could hand in some homework. I told him that he wasn’t around, but I could take the homework and give it to him when I saw him. He thanked me and left. It was Hasheem Thabeet. So in honor of Mr. Thabeet handing in his STAT 1100 homework to me, I have started a Yahoo! tourney pick’em group.

Join the Stats in the Wild Yahoo! Tournament Challenge. Join group #166142. Good luck.

Every year in March I start to care about college basketball. I even build some half-assed model’s to do my bracket. I finished in the 99th percentile of all yahoo brackets last year by managing to pick the champion, the two finalists, and all four final four teams. (My goal for a long time was to pick all four of the final four. I’ve been stuck on three out of four since high school.) Anyway, in order to make my predictions, I build several models and then I combine the results.

Here’s some thoughts on the tournament based on those models:

Most overrated teams in the tournament:
1. LSU: The SEC is a great football conference and a mediocre basketball conference, (at least this year) and LSU is leading the charge of mediocre teams in this conference. LSU only wins their first round game because they play Butler. Speaking of Butler…..
2. Butler: They lost to Wisconsin Green Bay, Wisconsin Milwaukee and Loyola Chicago. Then they lost to Cleveland State in their tournament. Although in their defense they did beat IU-South Bend 87-33. They played two top 25 teams all year and they went 1-1. Ohio state (who is no longer in the top 25) and Xavier (another over rated team). Speaking of Xavier…..
3. Xavier: Look. I’m a Umass fan, (from the days of Roe, Camby, Dana Dingle, Donta Bright, Derek Kellogg) but the A-10 stinks. Xavier deserves to be in the tourney, but as 4 seed? I doubt it. This is the team the closed it’s regular season 4-4 with losses to Duquesne, Dayton, Charlotte, and Richmond, then lost to Temple in their conference tournament. I’m not exactly inspired by the musketeers. Florida State beats them in the second round by 10.
4. Dayton: An at large bid? So Dayton is better than St. Mary’s, Penn State, Florida, Auburn, Creighton, and Miami? That’s what the selection committee is saying.
5. Boston College: USC should dispose of them in the first round. USC by 12.

Most Underrated teams in the tournament
West Virginia: I’ve got West Virginia as a Sweet sixteen team. They are going to dismantle Dayton in the first round, then beat Kansas by 2. Then they have a good chance against Michigan State for a shot at the Elite 8.
Wisconsin: I think they got screwed with a 12 seed, and then by drawing a very good FSU 5 seed. If they can get past this game, they should cruise into the sweet sixteen past Xavier.

First round upsets: USC over BC.

Most likely 13 seed to win first round: Cleveland State
Most likely 12 seed to win first round: Arizona
Most likely 11 seed to win first round: Utah State
Most likely 10 seed to win first round: USC
Most likely 9 seed to win first round: Butler (Only because LSU stinks too.)

Biggest 3 seed blowout: Missouri
4 seed: Washington
5 seed: Illinois
6 seed: West Virginia
7 seed: Cal
8 seed: Ohio State

First number 1 seed gone: UConn

Good first round bets:
Washington -220
FSU -145
Utah -110
Illinois -200
Arizona St -200
Michigan +190 (This price is fantastic)
Texas A and M +115
Oklahoma St +115 (Seriously? They’re an inderdog here?)
Ohio St -160

And finally, the offical final 4 picks of Stats in the Wild (Drumroll Please):
UNC, Pitt, Memphis, and Lousiville with UNC beating Memphis in the finals 76-75.

Also, I am picking San Diego State to win the NIT with a 72-67 win over Miami in the finals.

And for my boys in the NIT…..what? Thy’re not even in the NIT. Bring back Calipari!
Go.
Go U.
Go U – Mass.
Go Umass!

Cheers.

ENAR in the wild

I’m sitting at Bradley International Airport waiting for a flight to Raleigh/Durham. Then I’ll hop on a connector to Memphis, followed by another connector to San Antonio, Texas. What’s in San Antonio, you might ask? The Alamo? Tim Duncan? Well yes, but why would I post that on a stats blog?

The reason I am headed to the Lone Star state is for the Eastern North American Region (ENAR) (pronounced EE-NAR) of the International Biometric Society (IBS). (IBS is an unfortunate acronym in itself, but I should note that the Western North American Region (WNAR) is pronounced WEE-NAR. That is unfortunate. I am also extremely childish.) I’ll be there from tonight until Monday afternoon, so I can attend a few talks in the morning before I leave.

On SUnday night from 8-11pu, I’m giving a poster presentation about synthetic data related to the paper I just published. It’s like a student poster bonanza. And they serve alcohol. So it should be interesting.

I’d love to post the abstracts to some of the talks I am interesting in attending, but Bradley’s internet connection is not letting me view the abstract page.

If Raleigh/Durham or Memphis has a better connection, I’ll post them when I get there.

Cheers.

Probability in the Wild

This isn’t really “statistics” or “in the wild” since it’s a problem from an advanced probability class I am taking, but it’s a neat problem so you’ll just have to live with.

Here is the game:
I get a die and the opponent gets a die. (These dice are not necessarily numbered 1,2,3,4,5,6. The sides could be numbered 1,1,1,3,3,3, for example, but the dice has six sides.) We both roll. If you’re number is higher than my number, you win a dollar. If my number is higher than your number, I win a dollar.

Can you design three six sided dice in such a way, that if the opponent chooses his die first, I can always choose a die from that remaining two that makes my expected winnings positive?

STOP HERE IF YOU WANT TO TRY THIS YOURSELF!

escherdice

You probably didn’t even try. So here’s an answer.
Here are three possible dice:
A: 0 0 6 6 6 6
B: 4 4 4 4 4 4
C: 2 2 2 2 8 8

Dice A will beat Dice B with probability 2/3.
Dice B will beat Dice C with probabilty 2/3.
Dice C will beat Dice A with probabilty 5/9.
(Pr(C>A)=P(C>A|A=0)*P(A=0)+P(C>A|A=6)*P(A=6)=1*1/3+1/3*2/3=5/9)

So if I let my opponent choose first, I can always make a choice that will make my expectation positive.

The moral:
Make these dice, head to a bar, and fleece some drunks ( Just don’t tell them I sent you). I guess that’s the “in the wild” part.

Cheers.

Shark, the Global Financial Crisis, and Correlation (in the wild)

I was reading the fantastic blog The Bernoulli Trial today, (the Stock Market Seismometer is very interesting.) and I stumbled onto one of his links which brought me to The Mr Science Show. All of the posts are very interesting, but one post seemed particularly relevant to this page.

One of the topics I have talked about in the past is correlations between two variables when the correlation is just a coincidence (like me being in grad school and the Red Sox winning the World Series). Here is a post on Mr. Science Show which relates the global financial crisis to the number of fatal shark attacks in the world. In fact, this correlation has won the prestigious Mr. Science Show Correlation of the Week. Congratulations.

Cheers.

Shirts in the Wild

Having trouble dressing yourself and love statistics? Solve your problems with T-Shirts with witty statistics slogans from SassyStatistics.com (The link was sent in to statsinthewild by avid reader Kevin S. Bagge.)

My favorite would be: “One way or ANOVA, I’m gonna getcha.

(Note: I am in no way financially involved with SassyStatistics.com)
(Note: If Sassy statistics wants to pay me, then I will gladly be involved with SassyStatistics.com)

Cheers.

Data Privacy in the Wild

I’m currently working on a systematic literature review of statistical issues of privacy. One of the papers I was reading today was from the Census. This paper discusses how how the Census maintains the privacy of individuals when it releases microdata to the public.

(I’ve been reading some papers which propose methods that are much more sophisticated and (hopefully) provide more privacy than the rather simple methods that the census mentions in this paper. I have no opinion on whether or not these methods are “sophisticated enough” to maintain privacy, I just think its interesting to see what the census is actually doing.)

The paper mention three techniques that the census currently uses to maintain privacy.
1.) release of data for only a sample of the population
2.) limitation of detail
3.) top/bottom-coding

Here is an explanation from the census as to why they use these techniques:
“The Census Bureau currently uses several standard techniques to mask microdata sets. The first is a release of data for only a sample of the population. Intruders (i.e., those who query the file for the sole purpose of identifying particular individuals with unique traits) realize that there is only a small probability that the file actually contains the records for which they are looking.  The Bureau currently releases three public use samples of the decennial census respondents.   One is a 1 percent sample of the entire population, the second a 5 percent sample, and the third a
sample of elderly residents. Each is a systematic sample chosen with a random start. None of these files overlap, so there is no danger of matching to each other. Most demographic surveys are 1-in-1000 and 1-in-1500 “random” samples. Generally the public use file for each survey contains records for each respondent. The second technique involves the limitation of detail.  The Census Bureau releases no geographic identifiers which would restrict the record to a sub-population of less than 100,000.  It also Arecodes@ some continuous values into intervals and combines sparse categories.  Intruders must have extremely fine detail for other highly sensitive fields in order to positively
identify targets.  The third technique protects the detail in sensitive responses in continuous fields. It is referred
to as top/bottom-coding. This method collapses extreme values of each sensitive field into a single value. For example, the record of an individual with an extremely high income would not contain his exact income but rather a code showing that the income was over $100,000.  Similarly the low-income records would contain a code signifying the income was less than $0.  In this example $0 is a bottom-code and $100,000 a top-code for the sensitive or high visibility field of income.” –CONTROLLED DATA-SWAPPING TECHNIQUES FOR MASKING PUBLIC USE MICRODATA SETS, Richard A. Moore, Jr.

The rest of the paper goes on to discuss a more sophisticated method of maintaining privacy called data swapping (first proposed by Dalenius and Reiss (1980)). If your not bored to tears already, then it’s probably worth reading about. I think it’s interesting.

Cheers.

R in the wild

Here is an interesting blog post about how Google and Facebook are using the R (Free statistical software) at their companies.

Cheers.