Stats in World War II (in the wild)
A friend of mine told me about this problem, so I went and looked it up. This is stats in the wildest of the wild.
So, during World War II, the Allies were trying to estimate the number of a certain kind of German tank. They needed this information to better plan their attacks and invasions. There were two sets of estimates made, one by intelligence and another by a group using statistical methods.
Estimates made using statistical methods in June 1940, June 1941 and August 1942 of the number of a certain type German tank were, respectively, 169, 244, and 327. The intelligence estimates for each of those same three periods of time were, respectively, 1000, 1550, and 1550. (from :Number of German tanks)
These estimates are drastically different, and depending on which estimate was believed, it is possible that battle plans may have been significantly affected. So who made the better estimates?
In most situations when we estimate something, we can never actual know what the true value is. However, as it turns out, after the war was over, German records became available and the actual number of tanks that they had at each of those three points in time became available. The actual number of tanks that the Germans had at the three points in time (June 1940, June 1941 and August 1942) when the estimates were made were, respectively, 122, 271, and 342. (Recall that the statistical estimates were 169, 244, and 327 for those three time periods.) The statistical estimates are astonishingly close. (As well as the intelligence estimates being alarmingly inaccurate.) So how did they do it?
The statistical group looked at the serial numbers of tanks that had been captured or destroyed by Allied troops, and they assumed that the serial numbers of the tanks were ordered from 1 to T where T is the number of tanks that the Germans had. So they assumed that if the Allies found a tank with serial number 200, that the Germans had at least (and almost surely more than) 200 tanks.
So if we assume that each serial number has equal probability of being observed our maximum likelihood estimate (our best guess) of T is simply the maximum serial number that we encounter on a destroyed tank. However, using the maximum encountered serial number to estimate T turns out to be an unbiased estimator. (If we always used the largest serial number as our estimate of T, we would be systematically underestimating T, because our largest observation is usually not the actual largest value.) So what we need is an unbiased estimator for T.
As it turns out the expected value of our estimator of T (the maximum observed serial number) is n/(n+1)*T (hence biased). So on the average the largest observed value will be smaller than actual T. To correct for this we simply multiply the largest observed value by (n+1)/n. This will give us an unbiased estimate for the number of tanks the Germans had, and this is how they reached their statistical estimates.
Say we observe 50 tank serial numbers and the largest observed serial number is 245. With all of the above assumptions, our unbiased estimate as to the number of tanks is 51/50*245=249.9.
If we observe 25 serial number and the largest is 110, our best guess is 26/25*110=114.4.
Here is a link to another blog post about the German tank problem.
Modern note: I saw online that someone was using this approach to try to estimate the number of servers that Google has. (More to come on that)
Ruggles, R., and Brodie H. (1947) An empirical intelligence in World War 2. Journal of the American Statistical Association, 42:72-91.
Goodman, L. A. (1954), “Some Practical Techniques in Serial Number Analysis,”
Journal of the American Statistical Association, 49, 97–112.