I’ve blogged about this before but it’s worth mentioning again as a good teaching example. The site How Many of Me purports to estimate how many people in the U.S. have any particular name. But it can give wrong results; as “the other Craig Newmark” noted, it said there was only one of him, and there are actually at least two.
What the site actually does is to plug in esitmates of the frequency of the first name and the frequency of the last name and assume independence. The results can be wrong.
This could be a great example for teaching probability. Three questions: first, how can you check that the site really is assuming independence; second, how many people does the site assume are in the U.S.; third, how could you do better?
1. How can you check that the site really is assuming independence? We’ll check four names and see how many it says there are of each:
Rebecca Schwartz: 171
Rebecca Smith: 6600
Mary Schwartz: 1047
Mary Smith: 40941
Calculate the ratios: 6600/171=39, 40941/147=39. Check.
Actually, to one more digit, the ratios are 38.6 and 39.1. Why the difference? Shouldn’t they be exactly the same? Playing around with the last digits reveals that it can’t be simple rounding error. Maybe some internal rounding error in the calculations? (Perhaps another good lesson for the class?) Hmm, let me go back and check. Number of Mary Schwartzes: 1047. Check. Number of Mary Smiths? 40491. Uh oh, I’d transposed the digits when copying the number. Now the ratios agree (to within rounding error)
The website is definitely assuming independence. I have no doubt that there are some Mary Schwartzes out there but no way that the frequencies of Marys among Smiths and Schwartzes is exactly identical.
2. How many people does the site assume are in the U.S.? The site says there are 4,024,977 people in the U.S. with the first name Mary, 3,069,846 people in the U.S. with the last name Smith, and 40,491 Mary Smiths. 4024977*3069846/40491 = 305 million. So that’s what they’re assuming.
3. How could you do better? Phone books are an obvious start. They don’t have everybody and there are other sampling difficulties involved (for example, a telephone that’s under the name of only one person in the family, leaving the others unlisted) but it would give you some clear information about how large are the discrepancies from indepdence.
And, a bonus:
4. A bad idea (which might be tried by a naive instructor who doesn’t get the point): Using this to teach the chi-squared test for statistical independence. This is a bad idea for two reasons: first, the data in HowManyofMe.com are not a sample under statistical independence; they are exactly statistically independent (a/b=c/d) and so a chi-squared test is beside the point. Second, for real data the point is not whether they could be explained by statistical independence–they can’t–but how large the discrepancy is. This can be expressed using probabilities or odds ratios or whatever but not by the magnitude or the p-value of a chi-squared test. (If you want to use this example to illustrate chi-squared, this is the point you’d have to make.)
P.S. I’ve never met the other Andrew Gelman, but I did once meet someone who lives down the street from him (in New Jersey).