## Where’s the other Craig Newmark, or, the statistical dependence of first and last names

I’ve blogged about this before but it’s worth mentioning again as a good teaching example. The site How Many of Me purports to estimate how many people in the U.S. have any particular name. But it can give wrong results; as “the other Craig Newmark” noted, it said there was only one of him, and there are actually at least two.

What the site actually does is to plug in esitmates of the frequency of the first name and the frequency of the last name and assume independence. The results can be wrong.

This could be a great example for teaching probability. Three questions: first, how can you check that the site really is assuming independence; second, how many people does the site assume are in the U.S.; third, how could you do better?

1. How can you check that the site really is assuming independence? We’ll check four names and see how many it says there are of each:

Rebecca Schwartz: 171
Rebecca Smith: 6600
Mary Schwartz: 1047
Mary Smith: 40941

Calculate the ratios: 6600/171=39, 40941/147=39. Check.

Actually, to one more digit, the ratios are 38.6 and 39.1. Why the difference? Shouldn’t they be exactly the same? Playing around with the last digits reveals that it can’t be simple rounding error. Maybe some internal rounding error in the calculations? (Perhaps another good lesson for the class?) Hmm, let me go back and check. Number of Mary Schwartzes: 1047. Check. Number of Mary Smiths? 40491. Uh oh, I’d transposed the digits when copying the number. Now the ratios agree (to within rounding error)

The website is definitely assuming independence. I have no doubt that there are some Mary Schwartzes out there but no way that the frequencies of Marys among Smiths and Schwartzes is exactly identical.

2. How many people does the site assume are in the U.S.? The site says there are 4,024,977 people in the U.S. with the first name Mary, 3,069,846 people in the U.S. with the last name Smith, and 40,491 Mary Smiths. 4024977*3069846/40491 = 305 million. So that’s what they’re assuming.

3. How could you do better? Phone books are an obvious start. They don’t have everybody and there are other sampling difficulties involved (for example, a telephone that’s under the name of only one person in the family, leaving the others unlisted) but it would give you some clear information about how large are the discrepancies from indepdence.

And, a bonus:

4. A bad idea (which might be tried by a naive instructor who doesn’t get the point): Using this to teach the chi-squared test for statistical independence. This is a bad idea for two reasons: first, the data in HowManyofMe.com are not a sample under statistical independence; they are exactly statistically independent (a/b=c/d) and so a chi-squared test is beside the point. Second, for real data the point is not whether they could be explained by statistical independence–they can’t–but how large the discrepancy is. This can be expressed using probabilities or odds ratios or whatever but not by the magnitude or the p-value of a chi-squared test. (If you want to use this example to illustrate chi-squared, this is the point you’d have to make.)

P.S. I’ve never met the other Andrew Gelman, but I did once meet someone who lives down the street from him (in New Jersey).

1. The first thing I'd try would be to build a mixture model which generates a first name and last name conditioned on the mixture component. The full model would draw a name type (mixture component) and then you'd draw first names and last names independently from type-specific distributions. I'm guessing the types would cluster socially and by root language. To be trendy (in natural language processing circles), I'd use Dirichlet process priors on the types and tokens to avoid having to fix their sizes a priori.

There was another Robert Lee Carpenter at Michigan State who was also a math major and at one point we were both teaching assistants for the same algebra/trig class during the same quarter! But my biggest competition for #1 Google rank for the query <Bob Carpenter> is the Bob Carpenter Center in Delaware.

2. Frizzled says:

Andrew,

May this biology postdoc tell you how great he thinks your "Bayesian Data Analysis" (2nd ed) book is. I've worked through the first six chapters so far and feel that it's helped me learn a great deal. It's embarassing how statistically ignorant many scientists are; perhaps we need to humble ourselves and start learning data analysis from the social scientists. I had a couple of questions…

1. For model selection, what do you think about using the Bayes Information Criterion as compared to DIC (for example as described for spatial statistics by Dasgupta and Raftery, JASA 1998). You don't seem to mention this in your model selection chapter. I'm interested in spatial statistics, for example model selection for feature matching in image analysis – is there a good reference for understanding how to implement this in a statistically-kosher way, with the possibility of extending the model to different kinds of features, etc?

2. Any chance of a hint for Exercise 3.8? The one about the bicycles – it's not in your solved problems list and I don't know how to form the posterior.

Thanks!

3. Lee Sigelman says:

So your Andrew Gelman number is, hmm, I guess the answer depends on which Andrew Gelman you're talking about.

4. Lee Sigelman says:

P.S. I just checked the website and it says that there be fewer than one Lee Sigelman. I'm not likin' the sound of that.

5. Andrew,

I am very interested in your point #4. In a recent field experiment on local primary elections, I gave voters varying amounts of information. I used a chi-squared test to see if voters with more information (like issue and biogrpaphical info) make different decisions than those with less information (e.g., just name and contact info).

Are you suggesting that this isn't the right test? If it isn't appropriate, what would you suggest?

Recall that these are dozens of races with two or more candidates in each, and there isn't any easy way to pool the candidate data because they are partisan primaries or non-partisan races.

6. vak says: