Survey weighting and that 2% swing

Nate Silver agrees with me that much of that shocking 2% swing can be explained by systematic differences between sample and population: survey respondents included too many Clinton supporters, even after corrections from existing survey adjustments.

In Nate’s words, “Pollsters Probably Didn’t Talk To Enough White Voters Without College Degrees.” Last time we looked carefully at this, my colleagues and I found that pollsters weighted for sex x ethnicity and age x education, but not by ethnicity x education.

I could see that this could be an issue. It goes like this: Surveys typically undersample less-educated people, I think even relative to their proportion of voters. So you need to upweight the less-educated respondents. But less-educated respondents are more likely to be African Americans and Latinos, so this will cause you to upweight these minority groups. Once you’re through with the weighting (whether you do it via Mister P or classical raking or Bayesian Mister P), you’ll end up matching your target population on ethnicity and education, but not on their interaction, so you could end up with too few low-income white voters.

There’s also the gender gap: you want the right number of low-income white male and female voters in each category. In particular, we found that in 2016 the gender gap increased with education, so if your sample gets some of these interactions wrong, you could be biased.

Also a minor thing: Back in the 1990s the ethnicity categories were just white / other and there were 4 education categories: no HS / HS / some college / college grad. Now we use 4 ethnicity categories (white / black / hisp / other) and 5 education categories (splitting college grad into college grad / postgraduate degree). Still just 2 sexes though. For age, I think the standard is 18-29, 30-44, 45-64, and 65+. But given how strongly nonresponse rates vary by age, it could make sense to use more age categories in your adjustment.

Anyway, Nate’s headline makes sense to me. One thing surprises me, though. He writes, “most pollsters apply demographic weighting by race, age and gender to try to compensate for this problem. It’s less common (although by no means unheard of) to weight by education, however.” Back when we looked at this, a bit over 20 years ago, we found that some pollsters didn’t weight at all, some weighted only on sex, and some weighted on sex x ethnicity and age x education. The surveys that did very little weighting relied on the design to get a more representative sample, either using quota sampling or using tricks such as asking for the youngest male adult in the household.

Also, Nate writes, “the polls may not have reached enough non-college voters. It’s a bit less clear whether this is a longstanding problem or something particular to the 2016 campaign.” All the surveys I’ve seen (except for our Xbox poll!) have massively underrepresented young people, and this has gone back for decades. So no way it’s just 2016! That’s why survey organizations adjust for age. There’s always a challenge, though, in knowing what distribution to adjust to, as we don’t know turnout until after the election—and not even then, given all the problems with exit polls.

P.S. The funny thing is, back in September, Sam Corbett-Davies, David Rothschild, and I analyzed some data from a Florida poll and came up with the estimate that Trump was up by 1 in that state. This was a poll where the other groups analyzing the data estimated Clinton up by 1, 3, or 4 points. So, back then, our estimate was that a proper adjustment (in this case, using party registration, which we were able to do because this poll sampled from voter registration lists) would shift the polls by something like 2% (that is, 4% in the differential between the two candidates). But we didn’t really do anything with this. I can’t speak for Sam or David, but I just figured this was just one poll and I didn’t take it so seriously.

In retrospect maybe I should’ve thought more about the idea that mainstream pollsters weren’t adjusting their numbers enough. And in retrospect Nate should’ve thought of that too! Our analysis was no secret; it appeared in the New York Times. So Nate and I were both guilty of taking the easy way out and looking at poll aggregates and not doing the work to get inside the polls. We’re doing that now, in December, but I we should’ve been doing it in October. Instead of obsessing about details of poll aggregation, we should’ve been working more closely with the raw data.

P.P.S. Could someone please forward this email to Nate? I don’t think he’s getting my emails any more!

31 thoughts on “Survey weighting and that 2% swing

  1. Here’s another idea–don’t do polls to try to predict the election at all! What do such polls REALLY add? They help to satisfy our curiosity facing the election, but is that really a contribution? Just have the election–that’s the only survey that matters.

    • Tom:

      It is helpful to understand public opinion. I agree there are way too many polls—I’ve said so for a long time, and I’m sure you can find a few such statements on this blog. But I think zero polls would be too few.

      Also, public opinion is of interest to news readers. You don’t want polls, other people might not want celebrity news, or sports news, or movie reviews, or whatever. It would be perverse for news organizations not to report on a topic that is of interest to so many readers.

    • Has anyone written something about replacing polling and elections with a system whereby citizens can register their approval or disapproval of policies and politicians continuously? It seems to me that the technology now exists for every citizen to have a secure account in which they can register their preferences for candidates, and even legislative proposals, on a running basis, updating their “profiles” as their opinions change. The vote totals could be made public daily (the account would be wholly private of course), but the office would only change hands every four years according the vote on some specified date. There would no doubt be increased activity up to “election day”; but I don’t think we’d poll people to try to predict this activity. We’d just look at the actual trends. (This could work for all elected officials. Your profile/account would let you state your preference for all officials you are able to vote on.) I guess I’m imagining a system of polling where the sample = the population.

      • I’ve thought of something like this too, but (perhaps reading more into my idea than yours) then you remember that our system of government is not a pure democracy. Which doesn’t make it a bad idea, just a fundamental change.

        • If by “not pure” you mean “representative”, I should clarify that that’s mainly also what I have in mind. It’s just a system to elect representatives (and give them some guidance about what the population thinks about particular policies). If by “not pure” you mean that there are also nondemocratic powers in the society, then it’s still interesting for the mandate of the people’s representatives to be transparent. I agree that the effects could be quite fundamental, but I really just see it a technical change to voting system.

        • I meant more the former. I think it would be valuable for the population to have a better grasp on what their fellow citizens think, and if you added a feature where it compared the polling to your local/state/federal representative’s voting that would be pretty cool. But to the extent that we are a representative democracy in part to distance the populace from decision making, this level of transparency could actually be counter-productive.

      • I’ve been saying for years and years (to those family members who roll their eyes) that all elections to office should be run using score voting with integer scores between and including 0 to 10. Put a score next to each candidate, no score = 0, sum the scores and elect the one with the highest score. It still has strategic voting issues, but it DOESN’T have some of the perversity of our current system and it DOES get around Arrow’s Theorem because it’s a cardinal voting system not an ordinal one (it expresses strength of preference not just order of preference).

        This puts things on a “more like continuous” approval scale. I think you’re talking about a continuous-in-time system. That’s not a bad idea either.

        Of course, there’s no reason the govt has to do this. You could set up a non-profit to create a comprehensive low-bias randomly selected panel of people and pay them to answer questions every week or so. A panel of just say 10,000 people paid to respond (so non-response isn’t a major issue) if chosen well would be a substantial improvement over current systems. You could have say 10% turnover per year to deal with people who become sick, move, drop out for other reasons, die etc. This probably partly exists already somewhere, but a single comprehensive source with online website for people to participate and frequent questions of various topical types could be quite useful for policy makers and the public to have as a reference.

    • Tom – one benefit to polls is addressing coverage, maybe most specifically participation in the debates. One could certainly argue about where to place the thresholds, but I think it’s hard to argue that literally everyone who could get a vote for president should be involved in the same debate. This general idea could carry down to media coverage in general, but I admit that I start to become more sympathetic to giving people a platform even if they are little-known.

  2. Kind of disagree. Or rather I think that this analysis is correct as a slice but that it seems to not tell the whole story, though it is presented as doing so. I keep checking the published data and don’t have pollster data, etc., nor the inclination to look at it, and I keep seeing that turnout in the key states wasn’t great, that Trump’s numbers there really didn’t show much improvement over Romney – though his numbers overall were up but then mostly in the red states, apparently because Romney’s Mormonism cost something over 2 million votes – and that Hillary fell far short of Obama’s numbers. I’ve noted this in other comments: the drop in Dem voters in key cities, meaning the drop in the black vote, was more than the entire state difference. So what I see is, yeah, Trump “won” because of less educated white voters but that Hillary would have won if the Dems turned out more black votes. One can argue the former was somewhat unexpected, though my reading is that it was bad work to ignore that. The latter I have seen discussed with regard to exit polls and now in terms of numbers of votes but not in the context of pre-election polls. Where I grew up, for example, Wayne County, Michigan, which includes Detroit, saw Hillary get nearly 80,000 fewer votes. If you look at the GOP/Dem splits, there was a total drop of about 61,000 votes, meaning an increase in the GOP vote of about 15k and the rest of the 20k+ difference going to the 3rd parties and then 61k simply gone from the Dem pile. So in a county with a substantial less educated white population, GOP vote was up somewhat but the absolute loss in votes to the Dems was 4x that. And note as well that voter registrations in the county were also down substantially, by about 27k, which I would guess is black votes lost to the Dems in another slice, so they didn’t register voters and didn’t turn them out, which says the only way Hillary would have won MI is if white people failed to vote too and all you see is that the percentage turnout was the same but of the lower number of registered voters.

    • But remember too that the population of Detroit is not stable, it is declining pretty rapidly for a city, and so you would not expect the same absolute numbers as you got 4 years ago even if there were no other change. I have been wondering if there was an issue with overweighting respondents in shrinking cities.

      • But also remember that Wayne County is not just Detroit — it includes 33 other cities and 9 unincorporated townships. So one needs to consider the demographics of all of those, not just Detroit. Departures from Detroit have partly been to other areas within the county.

        • I’ve been thinking that it’s somewhat of a Katrina effect, some are nearby, some are in Baton Rouge (i.e. elsewhere in Michigan) and some are in Houston etc (out of state).

        • I don’t think Katrina is a good analogy — Katrina was a one-time event, whereas Detroit’s population decline in the past 4 years is a continuation of a process that’s been going on for a long time.

          In any event, one would need to look at Wayne County population shifts rather than Detroit population shifts to be able to say anything relevant to Jonathan’s arguments.

        • Interesting. A study of the population growth and decline in Detroit and Wayne County could be quite interesting. (Or maybe I’m prejudiced — I grew up in Detroit when it was still pretty much in its heyday; my grandparents moved there in around 1910 and 1917 because that’s where the jobs were then; and my parents moved to Florida in around 1965 because that’s where the jobs were then — Detroit was definitely in decline by then!)

  3. “In retrospect maybe I should’ve thought more about the idea that mainstream pollsters weren’t adjusting their numbers enough.”

    Told ya.

    http://statmodeling.stat.columbia.edu/2016/09/23/trump-1-in-florida-or-a-quick-comment-on-that-5-groups-analyze-the-same-poll-exercise/

    Terry says:
    October 5, 2016 at 3:05 am
    Andrew,

    This sounds like a very big deal. Shouldn’t you be looking at all the polling results to see if there is a systematic bias?

  4. I am confused. I had thought that a key part of what made Mr. P different from simple weighting was using cross-tabs to address the interaction among demographic covariates: Kastellec, Lax, and Phillips (http://www.princeton.edu/~jkastell/MRP_primer/mrp_primer.pdf) say “Be careful here. MRP requires knowing not just the simple state-level statistics reported in the Statistical Abstract, such as the number of females or African Americans in a state. If your model treats opinion as a function of gender, race, age, and education you will need to know, for instance, the number of African American females aged 18 to 29 years who are college graduates”

    But here, you say, “Once you’re through with the weighting (whether you do it via Mister P or classical raking or Bayesian Mister P), you’ll end up matching your target population on ethnicity and education, but not on their interaction, so you could end up with too few low-income white voters.”

    To me, these two statements seem to conflict. I must be misunderstanding something, but I don’t know where my error is.

    • My problem is reading comprehension. I missed the connection to the previous paragraph, where you wrote, ” Last time we looked carefully at this, my colleagues and I found that pollsters weighted for sex x ethnicity and age x education, but not by ethnicity x education.”

      That clears it up. D’oh!

  5. On an individual poll with a sample size of, say, 1500, there are limits to how much weighting you can do without ending up like that USC poll that gave a large weight to one African-American. This is your basic bias-variance tradeoff.

    And the poll aggregators don’t solve this problem because they are aggregating after whatever mysterious weighting was done.

    And in THIS election, THIS particular weighting would have done better. But in future elections?

    Somebody smarter than me once wrote “Survey weighting is a mess.”
    (first sentence here: http://www.stat.columbia.edu/~gelman/research/published/STS226.pdf )

    Maybe we need some sort of random-forest-like sampling of weighting schemes for survey data? Like the Florida poll example, but with many more different weightings done, resulting in a distribution of weighting outcomes??

    • Zbicyclist:

      You can do multilevel regression and poststratification in a poll of 1500. Actually, that Florida poll we analyzed had only 800 or so respondents. The multilevel modeling partially pools so that when data are sparse the estimate does not overfit. The key is to forget about “weighting” and go directly to the problem of estimating attitudes in the population.

  6. One curiosity I’d like to see addressed is that the statewise residuals of the ballots vs the polls for 2016 match, point for point, the residuals of 2016 ballots predicted by 2012 ballots. (See articles earlier at this blog for the pictures of residuals I have in mind). That is, the 2% difference (results – polls) is informative of (results2016 – results2012) – meaning it is a real world effect.

    Why would the polls miss a real change in the electorate between 2012 and 2016 at such a fine level of detail across all states? It seems too coincidental that the pattern of residuals for a demographic calibration error would match, state by state, the changes in the electorate — *unless the correlations exactly correlated with the demographics* This, by the way, was the signature of the problems with the Bell Curve for IQ.

  7. As far as I can tell you have studiously avoided one possible reason the polls and vote tallies differ: the official tallies are wrong. The results aren’t only different from the pre-election polls, they differ from the exit polls. One could say that vote tampering was a crazy hypothesis, but that’s name-calling, not science. American elections have been tampered with in the past (e.g., LBJ’s election to the Senate–both his first run when he lost and the second when he won); the Republican party has been very visibly tampering with all stages of the election process before the tallying by throwing people off the rolls, adding requirements, and making it harder to vote. Voting involves computers, and they are hackable, and the voting is administered by partisans, often on equipment manufactured by partisans. Some of the voting restrictions have been found illegal in the courts; is it really so hard to imagine that other illegal behaviors, like messing with the vote total, might occur?

    The best way to find if the official tallies are accurate is to audit or recount the actual votes, where that is possible, and fixing things so that such audits are possible everywhere (for next time). Fair elections require much more than accurate vote-counting, but it is an essential part.

    • With specific reference to election polls, it is unlikely they could be informative about official tallies being wrong. If you model the counting process as a random walk (for the difference between the counts for the two leading candidates in a given state), in order to win an election you must be able to predict the final oscillation of the last zero crossing. The relevant theorems in the theory of random walks are ‘the ballot theorem’ and ‘the arcsin principle’ and ‘Lagrange’s formula for last crossing’. I urge anyone interested to look those formulas up — they do not look promising for 1% or even 5% of the data, leaving aside measurement error.

      The arcsin principle (for ultimate ruin of one candidate or the other) is especially fun – it predicts that, of the possible counting paths, most of the uncertainly is in a bath shaped curve that is exactly mirrored about a coin-toss. For every uncertainly in the first percent of the data, there is an exactly mirrored uncertainty in the last percent of the data. Most of the information comes from the first few votes, which establish the trend — all the arguments are here — *and* the last few, which determine the winner.

  8. Thanks for the references. However strong your argument is that exit polls can’t prove there was a problem (and I don’t think it’s as strong as you say, since the information is informative, even if not probative), it doesn’t address my main point: it’s a mistake to assume the vote tallies are reliable.

Leave a Reply to Natasha Rostova Cancel reply

Your email address will not be published. Required fields are marked *