A question came in which relates to an important general problem in survey weighting. Connie Wang from Novartis pharmaceutical in Whippany, New Jersey, writes,
I read your article “Struggles with survey weighting and regression modeling” and I have a question on sampling weights.
My question is on a stratified sample design to study revenues of different companies. We originally stratified the population (about 1000 companies) into 11 strata based on 9 attributes (large/small Traffic Volume, large/small Size of the Company, with/without Host/Remote Facilities, etc.) which could impact revenues, and created a sampling plan. In this sampling plan, samples were selected from within each stratum in PPS method (probabilities proportionate to size), and we computed the sampling weights (inverse of probability of selection) for all samples in all strata. In this sampling plan, sampling weights for different samples in the same stratum may not be the same since samples were drawn from within each stratum not in SRS (simple random sample) but in PPS/census.
After all samples were drawn, we want to estimate the revenue of large/medium/small size of companies (the above sampling plan was created for other purpose) respectively. We poststratified all samples based on size of the company only. Obviously, this re-classification was totally different from the original stratifying based on 9 attributes. Let’s assume that there are x samples falling into large company group under this re-reclassification. Obviously, these x samples could be from any original stratum or several original strata since this poststratification was regardless of the original stratifying. e.g., some samples in original stratum h could fall into large company group under this re-classification and other samples in original stratum h could fall into medium company group under this re-classification. Now my question is: are the original stratified sampling weights valid for the large company group from poststratifying! ? e.g., can we multiply the revenue of each sample by its original stratified sampling weight for all x samples in the large company group poststratified and add up all products to get the total revenue of the large company group (subtotal)?
I guess the sample weights are not valid for the post-stratified large company group, noting that the check rule on sample weights: add up all sample weights in a stratum (poststratified) and see if the sum is equal to the stratum size (subpopulation). So, we cannot multiply the revenue of each sampling unit by its original sampling weight to get the total revenue.
My thinking is that the original sampling weights need to be adjusted (rescaled) to get new sampling weights for the post-stratified new strata, and then the revenue of each sampling unit can be multiplied by its new sampling weight to get the total revenue. Another reason to adjust for sampling weights is non-responses.
1. You use PPS sampling. Usually PPS is used in cluster sampling, where the first stage is PPS and the second stage samples a fixed number of units per cluster, so that the unit-level sampling is equal probability of selection. (See, e.g., these references for more on these issues.) So if you’re just doing PPS, I don’t know that you’ll need unit-level weights (or, if you do, they shouldn’t vary much if the survey was designed well.)
2. The full solution to your problem is to poststratify on the 2-way table of the 11 design strata crossed with the 3 size categories–that’s 33 post-strata in all. If you have enough data, you can simply do full poststratification–that is, get a separate estimate for each of the 33 cells, and then add up the rows of the table to get estimates for each of the 3 size categories.
3. Another option, which might work better if the sample size is smaller, is to rake: poststratify on your 11 strata, then on your 3 size categories, and iterate this a couple times so that your weighted sample matches the population in both these dimensions.
You certainly shouldn’t need to take a new sample (unless the original sample has too small a sample size in one or more of your post-strata of interest).
P.S. Connie sent me a new version of her question so I altered her wording above as requested.