## Using Mister P to get population estimates from respondent driven sampling

From one of our exams:

A researcher at Columbia University’s School of Social Work wanted to estimate the prevalence of drug abuse problems among American Indians (Native Americans) living in New York City. From the Census, it was estimated that about 30,000 Indians live in the city, and the researcher had a budget to interview 400. She did not have a list of Indians in the city, and she obtained her sample as follows.

She started with a list of 300 members of a local American Indian community organization, and took a random sample of 100 from this list. She interviewed these 100 persons and asked each of these to give her the names of other Indians in the city whom they knew. She asked each respondent to characterize him/herself and also the people on the list on a 1-10 scale, where 10 is “strongly Indian-identified,” 5 is “moderately Indian-identified,” and 0 is “not at all Indian identified.” Most of the original 100 people sampled characterized themselves near 10 on the scale, which makes sense because they all belong to an Indian community organization. The researcher then took a random sample of 100 people from the combined lists of all the people referred to by the first group, and repeated this process. She repeated the process twice more to obtain 400 people in her sample.

Describe how you would use the data from these 400 people to estimate (and get a standard error for your estimate of) the prevalence of drug abuse problems among American Indians living in New York City. You must account for the bias and dependence of the nonrandom sampling method.

There are different ways to attack this problem but my preferred solution is to use Mister P:

1. Fit a regression model to estimate p(y|X)—in this case, y represents some measure of drug abuse problem at the individual level, and X includes demographic predictors and also a measure of Indian identification (necessary because the survey design oversamples of people who are strongly Indian identified) and a measure of gregariousness (necessary because the referral design oversamples people with more friends and acquaintances);

2. Estimate the distribution of X in the population (in this case, all American Indian adults living in New York City); and

3. Take the estimates from step 1, and average these over the distribution in step 2, to estimate the distribution of y over the entire population or any subpopulations of interest.

The hard part here is step 2, as I’m not aware of many published examples of such things. You have to build a model, and in that model you must account for the sampling bias. It can be done, though; indeed I’d like to do some examples of this to make these ideas more accessible to survey practitioners.

There’s some literature on this survey design—it’s called “respondent driven sampling”—but I don’t think the recommended analysis strategies are very good. MRP should be better, but, again, I should be able say this with more confidence and authority once I’ve actually done such an analysis for this sort of survey. Right now, I’m just a big talker.

1. awm says:

I would think the tricky part would be fitting a model to account for the network structure of dependencies between the observations in step 1. That and knowing how “gregariousness” is distributed in the target population.

• Andrew says:

Awm:

Yes, I agree, some modeling is needed; it won’t work without additional assumptions. My intuition is that it would be possible to make lots of progress in these applications by adding some reasonable assumptions, recognizing that the resulting analyses will not be perfect.

2. Calum says:

I agree that the challenge is step 2, since respondent-driven-sampling is usually used with populations for whom it isn’t easy to find (non-RDS!) estimates of any variable. As for estimating the gregariousness, I find this from Katherine McLaughlin very appealing: http://www.stat.ucla.edu/~katherine.mclaughlin/JSMpaper_mclaughlin.pdf Once you have a good measure, Gile et al. have borrowed from oil and gas prospecting to weight the participants to estimate population prevalences.

3. Peter Dorman says:

I’m going to go out on a limb, since I’ve usually recommended against this sampling methodology (also called snowball, yes?), and I’m just guessing about how to adjust for it—but here goes.

It strikes me that we have a form of clustering here, where people referred by individual X are socially clustered. This might be important in a substance abuse study: if X is a substance abuser, acquaintances of X are more likely to be as well. If this is true, we ought to make the appropriate adjustment.

As for gregariousness, this arises for two reasons. (1) More gregarious individuals will suggest more names, biasing further rounds of the sample in the direction of their social circles, and gregariousness may covary with other factors that might be tend to bias the sample of referrals. (2) The individuals more likely to be referred are more likely to be gregarious, i.e. not socially isolated. (This is what I think Andrew had in mind.) Is there a literature on addiction and social isolation? It’s not obvious to me which way this would bend, but it may well go in one way or another.

Anyway, (1) can be dealt with by changing the sampling strategy: ask each of the original 100 to provide a fixed number of names. Then we have only (2).

I would also worry a lot about measurement error and nonresponse in a study like this. (a) Respondents with drug problems might not reveal them. (b) People may be more likely to refuse participation in the study if they have drug problems. I’ve dealt with revelation issues (sensitive topics) in the past by asking both individual and ecological questions: do you do x, and what proportion of your community do you think does x? (b) is standard nonresponse bias.

4. Greg Snow says:

This is great, I would love to see some “case study” style examples that have data, Stan code, and a step by step description of what assumptions are being made and how they affect/effect the likelihood used.

I often use an example in teaching (frequentist classes) about random sampling that if you are interested in the mean height of students at the university, then using the heights of the basketball team would be convenient, but not very representative. I have thought for a while that there may be a Bayesian approach that would allow estimation of all students heights based on the height of the basketball teams, depending heavily on assumptions and a likelihood that takes the sampling into account. This example has some similarities (but probably more sensible than the basketball example).

It seems that a simple starting place for measuring gregariousness would be the number of people that each person refers (in the 1st 3 samples) and the number of times they were referred (in the last 3).

• I’d also love to see an example on how to do this! So far, the only approach I can see would be to use some (hopefully reasonable) estimate on what my population density might be. I can’t really see how we can build a model to e.g. estimate the population of Indians / taxi drivers / sushi restaurants / whatever based on interviewing 400 of them?
Cheers, Daniel