Hey! Here’s what to do when you have two or more surveys on the same population!

This problem comes up a lot: We have multiple surveys of the same population and we want a single inference. The usual approach, applied carefully by news organizations such as Real Clear Politics and Five Thirty Eight, and applied sloppily by various attention-seeking pundits every two or four years, is “poll aggregation”: you take the estimate from each poll separately, if necessary correct these estimates for bias, then combine them with some sort of weighted average.

But this procedure is inefficient and can lead to overconfidence (see discussion here, or just remember the 2016 election).

A better approach is to pool all the data from all the surveys together. A survey response is a survey response! Then when you fit your model, include indicators for the individual surveys (varying intercepts, maybe varying slopes too), and include that uncertainty in your inferences. Best of both worlds: you get the efficiency from counting each survey response equally, and you get an appropriate accounting of uncertainty from the multiple surveys.

OK, you can’t always do this: To do it, you need all the raw data from the surveys. But it’s what you should be doing, and if you can’t, you should recognize what you’re missing.

14 thoughts on “Hey! Here’s what to do when you have two or more surveys on the same population!

  1. Aggregating individual polls data might count overlapped individual respondents more than once, and hence leads to another level of overconfidence? Also, it is debatable how possible it is for different polls to have exactly the same population. That shift might induce further uncertainty, as intuitively the more independent each individual poll is, the more variance shrinkage the final aggregation could gain.

    • Yuling:

      If anyone responds to both surveys, I’d include their response just once; I would not include duplicate responses. But above I’m thinking of surveys of large populations, in which duplicates would be so rare as to be essentially irrelevant in practice.

      • For duplications, including all of them leads to overconfidence as it is no longer independent Bernoulli, but deleting them without further adjustment will lead to bias. Sure, duplication is rare in large-scale polls, but I think it represents an extreme case of how data concentration/dependence will influence the aggregation.

    • Ideally you create a model for the data collection process that links the underlying population to the data collected through the sampling method, and then use different models for the different surveys, with a single underlying population you do inference on.

  2. Different polls and/or surveys are worded differently, have different sampling strategies, and were conducted at different times. Aggregation done properly can adjust for quality, recency, and the historical bias of the pollster, which can come from language, selection, or other rules used for polling (how many times to retry a number, etc.)

    Pooling is great when these are not present, and probably beats naive or badly done aggregation, but do you think that pooling would enable better predictive accuracy for a site like Five Thirty Eight than they manage now with aggregation?

    • Not a polling expert at all, but I would say yes. All of those factors will be accounted for by including indicators as Andrew suggested, at least as well as they are now with aggregation methods. You have additional external information about each poll? Fine, include it in the group-level model for the indicators.

  3. Doesn’t this approach assume the availability of individual response level data for each poll? What if the only thing available to you is tabular information, e.g., breakouts by geo-demo, or mixtures of response and tabular information?

Leave a Reply

Your email address will not be published. Required fields are marked *