PPS in Georgia

Lucy Flynn writes:

I’m working at a non-profit organization called CRRC in the Republic of Georgia.

I’m having a methodological problem and I saw the syllabus for your sampling class online and thought I might be able to ask you about it?

We do a lot of complex surveys nationwide; our typical sample design is as follows:

– stratify by rural/urban/capital
– sub-stratify the rural and urban strata into NE/NW/SE/SW geographic quadrants
– select voting precincts as PSUs
– select households as SSUs
– select individual respondents as TSUs

I’m relatively new here, and past practice has been to sample voting precincts with probability proportional to size. It’s desirable because it’s not logistically feasible for us to vary the number of interviews per precinct with precinct size, so it makes the selection probabilities for households more even across precinct sizes. However, I have a complex sampling textbook (Lohr 1999), and it explains how complex it is to calculate selection probabilities when sampling with probability proportional to size and without replacement. In fact, it only presents examples with n=2, because beyond that the formulas get so complex. However, I’ve read published papers where more than 2 clusters per stratum are selected with PPS and without replacement, so I’m wondering how people calculate the sampling weights. Is there a software package we can get where I can input the sizes of the clusters and the desired sample size and get without-replacement selection probabilities for each cluster? We use the program STATA and I think it may have an add-in that will do this, but I’m not sure because I haven’t been able to download it.

My reply:

1. Your sampling plan seems reasonable to me.

2. When sampling probability proportional to size, my recommendation is to sample with replacement. If you end up picking a particular unit twice, just gather a double-size sample from that unit. This makes all the formulas much easier to use. When computing estimates, standard errors etc., just treat any multiple samples from within a cluster as multiple clusters, and all should work out fine. (On the off chance that you get multiple samples from a tiny cluster and have to sample more people than actually exist in the cluster, then you can just do a complete sample for that cluster and correct with weighting. But in practice I doubt that will happen.)

3. I don’t use Stata myself but I’ve heard that it’s the way to go with surveys. You could try the Stata help list with your question. But if you follow my suggestion above, maybe everything will be simple enough that you won’t have to worry.

5 thoughts on “PPS in Georgia

  1. I generally get better results with urban/suburban/rural. This should be particularly true with Georgia, where ring cities are notably different from both the urban and rural areas.

  2. Andrew, thank you so much for your response, and thank you also to Mark for your comment.

    I drew the sample for one survey right before I left town two weeks ago, and I tried drawing a PPS sample with replacement. There are so many precincts that the selection probabilities are quite low, and I had only had one precinct in the sample twice. It was a precinct in the capital so it would have had plenty of households. However, from what I understood at the time from my textbook (Lohr 1999), when sampling with PPS with replacement, re-sampling of any PSU needed to be done independently and in the exact same way that the first sample from that PSU was taken.

    This is what I thought we would have to do to sample with replacement: We select households within the voting precinct using a systematic sample that begins at the precinct's polling station and takes a step size of 10 households (5 for rural, 7 for urban, and 10 for capital). If we sample a precinct twice, then the second sample needs to be performed independently, so that the interviewer starts at the polling station again, selects a new random number between 1 and 10, starts at the randomly selected households, and goes from there. If a different random number was selected then the samples would contain none of the same households, but if the same random number happened to be selected then we would have to go back to the same houses. Since we choose the individual respondent within the household according to whoever had the last birthday, this would mean that we would be interviewing the same individuals more than once.

    Just to be on the safe side, I went ahead and drew an SRS of precincts for this last survey. However, if I understand correctly that we can "re-sample" a doubly-selected precinct by simply doubling it's sample size, then that's completely feasible and we can switch to that for the next survey.

    Mark, regarding your comment:

    We need to keep the capital as a separate stratum because many client organizations are interested in saying things specifically about the capital. However, I think that you're right that the single urban stratum isn't ideal.

    I spoke with my boss about it and he said that the towns classified as "urban" vary widely in how "urban" they actually are. Any town that's a regional administrative center is considered urban, yet many of them essentially small rural towns. His idea was to re-classify the precincts in the urban and rural strata into a stratum where less than X% of total income is from agriculture (which would be our new definition of "urban") and a stratum where more than X% of the total income is from agriculture (which would be our new definition of "rural"). He said that we should be able to get sufficient data to do this from the Georgian Statistics Department. It will be a major undertaking because GeoStat's data is collected by city/village rather than by voting precinct and we'll have to match them up one by one, but he thinks it would be a worthwhile investment of time because we'll use the sampling frame so many times before they redistrict – the next election is probably 2 years off.

  3. I should have more clear. I meant to suggest adding the suburban category.

    I generally used census data matched on nine digit zip, but that was mainly because I was building mail models. An experienced demographer could probably come up with something better.

Comments are closed.