Election surprise, and Three ways of thinking about probability

Background: Hillary Clinton was given a 65% or 80% or 90% chance of winning the electoral college. She lost.

Naive view: The poll-based models and the prediction markets said Clinton would win, and she lost. The models are wrong!

Slightly sophisticated view: The predictions were probabilistic. 1-in-3 events happen a third of the time. 1-in-10 events happen a tenth of the time. Polls have nonsampling error. We know this, and the more thoughtful of the poll aggregators included this in their model, which is why they were giving probabilities in the range 65% to 90%, not, say, 98% or 99%.

More sophisticated view: Yes, the probability statements are not invalidated by the occurrence of a low-probability event. But we can learn from these low-probability outcomes. In the polling example, yes an error of 2% is within what one might expect from nonsampling error in national poll aggregates, but the point is that nonsampling error has a reason: it’s not just random. In this case it seems to have arisen from a combination of differential nonresponse, unexpected changes in turnout, and some sloppy modeling choices. It makes sense to try to understand this, not to just say that random things happen and leave it at that.

This also came up in our discussions of betting markets’ failure in Trump in the Republican primaries, Leicester City, and Brexit. Dan Goldstein correctly wrote that “Prediction markets have to occasionally ‘get it wrong’ to be calibrated,” but, once we recognize this, we should also, if possible, do what the plane-crash investigators do: open up the “black box” and try to figure out what went wrong that could’ve been anticipated.

Hindsight gets a bad name but we can learn from our failures and even from our successes—if we look with a critical eye and get inside the details of our forecasts rather than just staring at probabilities.

64 thoughts on “Election surprise, and Three ways of thinking about probability

    • Spotted:

      For reasons discussed in our comment thread (for example, Republican Senate candidates outperforming the polls just as the Republican Presidential candidate did), I doubt that social desirability bias in survey responses had much effect here. I think it’s much more about people not responding to the polls in the first place, and to other people not turning out to vote.

      • “In this case it seems to have arisen from a combination of differential nonresponse, unexpected changes in turnout, and some sloppy modeling choices.”

        +1.

        People fault pollsters and forecasters for getting the election wrong, but it really comes down to the increasing difficulty of acquiring representative samples. Nonresponse, cell-phone-only households, the growing reliance of pollsters on Internet samples (especially in the UK), etc., means that we simply don’t get accurate snapshots of the actual pool of voters. If we did, then predicting the election should be rather straightforward (assuming that voters don’t shift their preferences all that much during the campaign, especially as that date draws near).

        Andrew: What effect do you think the adjustments that pollsters, aggregators, and/or forecasters make at multiple points have on the accuracy of the prediction? For example, suppose a pollster weights the data to account for factors like census demographics, prior turnout, and estimated partisanship. Then a forecaster comes along and also adjusts the data using the so-called “special sauce.” To what extent might some of these recent failures be a product of adjustments like this working in the same (wrong) direction?

  1. I think what people tend to forget as well is:

    1. undecided voters are a large block and introduce more uncertainty that is estimated
    2. people change their mind on election day.

    • Ulrich:

      Yes. The advantage of our purely empirical effort was that we could add up all sources of error in the wild and see how bad things can ordinarily be, hence the lack of surprise of a 2% error. Then the next step is to go into the details of this particular election and look at all the different ways the error can have happened.

  2. Can you explain probability to me like a 5 year old. What does it mean when you say HRC has a 80% chance of winning the election? If this election has been done 100 times, HRC would win 80ish of them, and DJT would win 20ish of them. And now we are end up in one of the 20ish parallel universes?

    • Xname:

      Hmm, that’s a tough one. I’ve talked with 5-year-olds about probability, but I can’t recall the discussion being numerical. I’ll say that something is probably going to happen, or probably not going to happen. Sometimes Sophie will ask me which of two options will happen and I’ll say I don’t know, but she won’t accept that as an answer. She’ll say, insistently, But what do you think. It’s a kind of forced-choice response. Next time maybe I’ll tell her I’m 80% sure and I’ll see how she reacts.

      When speaking about the foundations of probability to college-educated adults, I point them to chapter 1 of Bayesian Data Analysis, which discusses probability in the context of several examples including sports betting, record linkage, and spell checking. The short answer is that I typically think of any probability in the context of some “reference set” of similar events. In this case, that would be other elections. Before the recent election I was asked what it meant to say that Hillary Clinton had a 90% chance of winning. I said that she had a small but persistent lead in the polls, and in 10 elections (thus, a 40-year period), one might expect the polls to be far enough off that a candidate polling at 50% of the two-party vote could end up losing.

    • > If this election has been done 100 times, HRC would win 80ish of them,

      Surely not. Except perhaps at the quantum level, the world is deterministic.
      The exact same thing would happen every time. For something to change, you’d have to intervene and do things differently.

      It’s just like flipping a coin. If you could go back in time and flip the coin again then if you’d do it the exact same way you did the first time, you’d get the same result

        • In 80% of the possible scenarios consistent with the information gathered from polls and our interpretation of how things work (ie. the model) Clinton would have won.

          If our interpretation of how things work is wrong, then it’s not surprising that something unusual under that interpretation happened. In this case, our interpretation of how things worked was that although individual polls could have errors, after pollers did their modeling, a time-series of those polls would on average give us the correct answer exactly.

          In fact, we knew, and should have included in models, the fact that it could only give us the correct answer within 1-5 percent due to the modern environment for polling (10% nonresponse, people with lots of different telephone technology, etc). That is, there could have been a bias in all of the polls and we didn’t know precisely how big or what direction.

      • >> If this election has been done 100 times, HRC would win 80ish of them,

        >Surely not. Except perhaps at the quantum level, the world is deterministic.

        The interpretation that I like is consider a set of outcomes in which the data we use (i.e. polling data, or aggregated polling data) is the same as the data we have in this case. In one of those elements of the set, we have HRC running against DJT, but since these names are not a part of the data we use to make our model, our theoretical set has a lot of other candidates running against each other that are not HRC and DJT, they just have the same polling data. 80 percent of the elements in the set with HRC’s polling data win. But HRC deterministically was not one of those.

        • Think of it more like this because I think this is a better more “deterministic” viewpoint:

          There are around 300M people in the US, x percent were going to vote in the election and of that x percent y percent vote for HRC and z percent vote for DJT.

          Then, there are a variety of “probes” of this state of the universe, that we call polls. the polls work in a certain way in terms of random digit dialing and then people ignoring or responding to the polls, etc.

          The information gained from the polls, together with our knowledge of how the polls work are *consistent* with a large number of possible states of the vector (x,y,z), and relatively inconsistent with other states of the vector x,y,z

          If you were to break out the x,y,z vectors into a dense set of discrete possibilities (a lattice on x,y,z) of those that were consistent with the polling process data and our understanding of the polling process methods, 80% of the lattice points had y > z

          Now, this is a heuristic not exactly how things work because there’s no need to have lattice points be equally weighted and so “80 percent of them” isn’t quite the right way to think about it. Also “consistent” and “not consistent” aren’t a binary situation. But if you start with this idea, you can get farther without getting mixed up than if you imagine somehow “repeated elections with other candidates” or alternative universes or some such idea.

        • +1

          The probability was 80% **conditional on the information we had**. If we had had more/better information, modelers would have predicted a lower probability. Given enough unbiased information (e.g. if we could peer directly inside everybody’s mind), we could have pushed the probability all the way down to zero, which was the correct probability **conditional on all possible information**.

          In other words, at the human level the world is deterministic (all probabilities **conditional on all possible information** are zero or one). However, we do not have all the possible information. Therefore we have to make educated guesses about the information we don’t have. This leads to probabilities between zero and one simply because we don’t have enough info to know if zero or one is correct.

        • “The interpretation that I like is consider a set of outcomes in which the data we use (i.e. polling data, or aggregated polling data) is the same as the data we have in this case.”

          You can “like” what you want. When you stubbornly use corrupted data, your preferences & your outputs become unimportant.

          GIGO

  3. Since I’m the one who kicked off what you describe as the “slightly sophisticated view” (which badge I will wear with pride!) I agree with you completely as to the more sophisticated view. But since we agree that essentially all the error is non-sampling error and since, with notable exceptions, few pollsters will open up their black boxes to show exactly how they have reweighted their raw data to account for nonresponse, likelihood of voting and (if any of them do) lying of one sort of another (this includes the euphemistically named social desirability bias as well as other strategic concerns) it’s not clear how you hope to learn anything. Your post on the Florida experiment makes that pretty clear, doesn’t it, where a bunch of polls with identical raw data saw Florida differently but it was impossible to pry open the results when most of the sauce was secret sauce. I mean I guess you try to reverse-engineer the adjusted polls, but hoo-boy that sounds impossible.

    • Jonathan:

      In this election, at the very least I think we’ve learned that we need to be a bit more serious about turnout modeling. All the discussion of people lying to pollsters (which I don’t think was happening) distracted people from the difficulties of predicting turnout.

      • Agreed again. But predicting turnout sounds to me…ummm… impossible? First, what we’re trying to predict is not actually turnout, but differential turnout with respect to measurable characteristics. What gives us even the vaguest hope that these things are stable in any way that allows past elections to serve as a guide? I certainly grant that we can get better at estimating the standard error *due* to turnout error, but while increasing the standard error due to faulty turnout estimation would’ve raised Trump’s ex ante probability of winning, it would have done so simply through our realization that polls aren’t as valuable as we thought. What I’m saying is, what evidence do we have that a (say) 15% probability for Trump wasn’t *exactly* correct owing to fundamental one-off unestimable aspects of turnout error. When a quantum experiment yields a 50% chance of a particle going through each slit, we can study why our model was this uncertain all we want, but taking our modeling more seriously isn’t going to accomplish anything.

        • Just to be clearer, turnout is to a large extent a choice variable of the candidates. Both campaigns have polls giving relatively precise estimates of the preferences of left-handed truck drivers and have data showing the prevalence of left-handed truck drivers by state. At this point, assuming there are enough of them, each campaign does what it can to either increase or decrease (as the case may be) turnout among left-handed truck drivers. And the extent to which they will do so is *entirely* dependent on these observed relative differences and the skill with which they do so is a function of their particular relative skills in doing so, on which we have no really interesting data, since each side attempts to learn from their experience in all precious elections. So if turnout is the result of a sui generis campaign, the pollster has the wrong data to estimate turnout.

        • Predicting turnout seems difficult (at least until the weather forecast is available). But could the problem be in qualifying “likely voters”?

          Is there a differential there, e.g. “likely voter” for Trump had a 90% chance of showing up, whereas a “likely voter” for Hillary had a 75% chance?

          It was noticed even by TV pundits that the size and enthusiasm at Trump’s rallies was Yuge relative to Hillary’s.

        • (trying to clarify my previous comment) i.e. the problem/solution might be in changing how we think about qualifying people as “likely voters”.

        • @zbicyclist, I applaud this thinking. But I am suspicious of our ability to do this effectively. In my (large, very rural) voting district, there are two polling places, each expecting to serve ½ of 3,000 registered voters. One polling place had 1,050 voters; the other had 1205; the highest % turnout in memory — by 30% or more. What could have permitted me to predict likelihood without brain scans of these registered voters?

        • There were claims early in the campaign that Trump hired “extras” to attend his rallies and be enthusiastic. Is there any evidence that this was indeed true, either in early rallies and/or in later ones?

        • Hey Martha, did you hear that claim on the Clinton News Network? Did you bother to attend a Trump rally? Honestly, most of you need to spend less time thinking about correlations etc etc etc & more time thinking (& participating in) Cognitive behavioral therapy.

          Note: I never attended a Trump rally, but I know people who did & I know 1000% that their comments carry more weght than anything on the Clinton News Network in the NYT or in jeff bezos’ blog, aka WaPo.

        • The only stories I saw were about hiring extras to cheer during his initial announcement. That was back when hardly anybody took him seriously.

        • “it would have done so simply through our realization that polls aren’t as valuable as we thought.”

          Bingo. Realization is what all of you need. Realize these simple realities:
          The media is completely corrupt.
          Pollsters are completely corrupt (being controlled by & in debt to the media).

      • One thing which we found when looking into the 2015 UK polling miss was what initially appears to be differential turnout error can actually be problems with representative samples. In particular that when you have too many voters and not enough non-voters in your sample, you end up believing that the electorate will look too much like the general population (or at least too much like all registered voters if that’s your population). Essentially, over-engaged samples substitute non-voters for voters from low turnout demographics.

        We’ve got a blogpost with a bit more detail on this:
        http://www.britishelectionstudy.com/bes-impact/what-the-2015-polling-miss-can-tell-us-about-the-trump-surprise

    • It’s pathetic to claim bragging rights for a so-called ‘less sophisticated’ view but screw it, I’m going to. Jonathan, I think I was actually the first to kick it off, even slipping in the term ‘black swan’. This was on November 9 I think, not sure, the last few days have been a blur

      • I was joking, but in the battle of the Jonathans, I would point out that I don’t think there’s anything even close to a “Black Swan” event here. Guys like Sam Wang aside, I don’t see that anyone got standard errors or expected values wrong.

  4. I think you’really missing the cognitive bias. We had some elements to distrust the polls (the Florida example of adjustment, the prediction of fundamentals in models such as the ones used by Bartels), but we thought Trump was extremist enough that he would lose some percentage points. And the polls fitted that. I remember a post of you that you once estimated that a extreme candidate would lose one to to two percentage points in the general election. And what about it? Does the result mean that the candidate position doesn’t matter? That campaigns don’t matter?

      • IDK, he is an extremist in that he advocates extreme measures and solutions (e.g. Muslim ban). However, he doesn’t appear to be to particular with regard to what he is extreme about.

        From wikipedia:
        Laird Wilcox identifies 21 alleged traits of a “political extremist”, ranging from behaviour like “a tendency to character assassination”, over hateful behaviour like “name calling and labelling”, to general character traits like “a tendency to view opponents and critics as essentially evil”, “a tendency to substitute intimidation for argument” or “groupthink”.

        • Calling Trump an ideological extremist seems like substituting argument for character assassination, name calling, and moral judgement. I don’t think anyone who is intellectually honesty can take Trump’s “proposed” Muslim ban seriously. But it does, of course, serve those who wants to engage in character assassination, name calling, and moral judgement well.

        • I don’t think anyone who is intellectually honesty can take Trump’s “proposed” Muslim ban seriously.

          “Taking it seriously” is a vague term. And Trump’s campaign has been inconsistent on this point. (This USA Today article reviews some of the wavering.)

          But Trump did clearly propose a total ban on Muslim immigration. Not in an off-the-cuff manner, but in a December 2015 press release on his own website. It begins with this sentence: “Donald J. Trump is calling for a total and complete shutdown of Muslims entering the United States until our country’s representatives can figure out what is going on.”

        • I guss this is all in the eye of the beholder but I think Ian was also referring to the left-right ideology spectrum. Proposing to seriously ban an entire religion of people from immigrating to the country and to additionally entertain the possibility of databasing their American coreligionists sounds awfully extremist right wing to me, at least in the vain of nativism and right wing populism.

        • Nearly all of you are blinded by bias. 1 OBVIOUS example:
          “Laird Wilcox identifies 21 alleged traits of a “political extremist”, …”

          That is describing Hillary Clinton & if you weren’t blind, you’d see that.

  5. Humm, upon further reflection, I think I was not clear. Let me rephrase what I wrote. I have 2 points: 1. We wanted to believe that Trump wouldn’t be elected, and thus we dismissed contradictory evidence and looked for what confirmed our bias.

    2. How on earth could Trump be elected? Does it mean that extremism doesn’t matter?

  6. As far as I know, most popular election models do not use “ideology” or “extremism” as a predictive factor so a candidate ‘being out of the norm’ doesn’t directly contribute to the election forecast.

    We would probably expect ideology and perception or misperception of ideology to affect turnout though, but it could drive both sides in strange asymmetric ways. How are you going to try to measure ideology or perception of ideology, especially for a candidate that has no voting record on legislation?

    But here’s the question of the day, what factors do we use to model potential turnout? Does only the candidate at the top of the ticket drive turnout? What about Governors or referendum or the extremist dog catcher in your neighborhood? Sounds like it’s time for a multi-level model!

    • Sean:

      An influential book in political science is Steven Rosenstone’s Forecasting Presidential Elections, from 1983. Rosenstone explicitly discusses the electoral benefits of moderation. There’s other work on this topic, including from me. The short answer is that theory suggests an electoral benefit of moderation and empirical evidence shows a benefit too, but a small benefit, small enough that it can often make sense for parties to run more ideological extreme candidates if the potential policy benefit is worth a slight risk of losing the election.

  7. I like the perspective in Gneiting, Balabdaoui, Raftery’s paper Probabilistic forecasts, calibration and sharpness. The basic point is that for two predictively calibrated models (a model is calibrated if 10% of its 10% predictions come true, 20% of its 20% predictions come true and so on), the one with sharper predictions is better (a sharp prediction is one near 0% or 100%).

    Sophie’s simply rejecting your dull model and asking you to build a sharper one! But be careful, if it’s not calibrated, she may be disappointed.

    • This is what I was ineptly trying to explain the other day. Some models optimize for predicting close to 0/1 and other models optimize for managing a betting operation (where you do not want to wrongly offer long odds but you still want people to bet).

  8. Just want to say that as a PhD student in statistics, I found your blog to be very clear and informative. I discovered it in the past few days and have been sharing your explanations and posts with a variety of people. Thanks for taking the time to write on these topics and dig into a variety of issues, including the lead and crime research.

  9. I’m late to this discussion.
    If a roulette wheel comes up black 15, I think that just the fall of the ball. If black 15 comes up twelve times in fifty spins, then I think the wheel is whack. Looking at recent political outcomes, I am entertaining the hypothesis that political polling is akin to astrology, tossing bones, and inspecting the bowels of animals.

    • Slugger:

      I don’t think the polls came up “15” twelve times in fifty spins. Consider some recent surprise elections:

      – Trump wins in Republican primaries. Trump was ahead in polls in Republican primaries. Polls did not fail. Blame the pundits who set aside the polls, don’t blame the polls.

      – Brexit wins with 52% of the vote. Polls predicted a close election. Even when polls showed Leave in the lead, prediction markets favored Remain. Pundits followed the prediction markets. Blame the pundits and market players who couldn’t handle the uncertainty indicated by the polls.

      – Trump wins presidential election with 50% of the two-party vote. Polls predicted 52%. Some poll aggregators were giving Trump a 2% chance of winning. Blame these guys who had no sense of history.

      I’m not saying the polls were perfect: as discussed in my various posts, a 2% error is still an error, and it looks like there was a misalignment between who was responding to the polls, and the population of voters. But to say that polling is like astrology, tossing bones, inspecting the bowels of animals, or papers in Psych Sci and PPNAS . . . I don’t think so.

      • And, oh by the way, Hillary Clinton was repudiated by the Democrats eight years ago in favor of an unknown black Senator from Illinois. She had “too much baggage”. In 2016 the baggage is greater. Why would we be surprised that the non-Democrats took their turn at repudiating her? What would a Bayesian make of this?

        • Randall:

          Our usual story is that what matters in the primary election is different from what matters in the general election. Donald Trump and Hillary Clinton both had lots of baggage, and the assumption was that this would be captured in the polls. But it does seem that we were not thinking seriously enough about turnout, hence that 2%. In many settings, a 2% error doesn’t matter much. But this time it did.

  10. In Brexit, I remember the Economist admitting their model showed Brexit winning but they didn’t believe it so they altered the model on a British equivalent to precinct level. From the few things I’ve read from actual pollsters this time, it appears they weighted responses in a way that adjusted their determination of turnout. That completely fits the reality of experiential models, that they project based on past performance and that ties them to a series of assumptions that fail.

    One hopes they fail in a far out tail event – like the arguments about what is or isn’t a “black swan”, etc., etc. – but they failed here because they’d succeeded recently, which says to me they’re terrific models as long as they work! It’s like rolling averages and other momentum models are really great when they work too and the same is true of asset correlation models, that they work great as long as the correlations based on past correlations hold!

    One problem: people won’t accept a broader range of error and clearly prefer a series of apparently definite statements about the range of outcomes rather than hear “well it could be this”.

    I do have a question generally, which is that looking at the urban black turnout I have to wonder what they did to adjust from Obama’s huge numbers to any non-black identified candidate. That to me is interesting because we have historical data and more recent data and it’s a complete mystery how one could say “this is what black turnout will be”: does it revert? Are people motivated differently now? And so on. As David Plouffe notes, the drop in black urban turnout was greater than the vote differentials in MI and WI.

    • I would love to see what would happen if the polling models for this 2016 election were compared to the 2012 election (instead of comparing election to election which is, of course how reality has changed). I suspect, looking at the residuals, they would do very very well in (retro) predicting 2012. It would at least answer where the demographic models are out of date.

  11. What about examining the quality of our inputs? Are polls good quality? Can we improve our measurements? Are we trying to predict outcomes from measurements too noisy to call?

    I think there’s too much focus on models but not as much on the quality of the measurements. Aren’t we committing the same blunder as the Psych Researchers?

  12. The Huffington Post predicted Hillary’s win with 98% and then called out Nate Silver for being a hack for giving Trump a double digit probability. I assume that your post is targeted at certain models and HP must be excluded somehow. But in that vein, were there “conservative” models out there that were dismissed? Why were they dismissed? If they were dismissed for non-response bias adjustments, did the authors try to support them or were they wishful thinking? If the “conservative” models still had Trump losing then that suggests there really wasn’t data out there that could have improved the models (unless there was a herding effect – they ignored their own private signal). As you have already pointed out, this is really on the margin – the models generally showed a tight race and they were right. It’s just unclear which models you are comparing and which you are throwing out.

  13. …we can learn from our failures and even from our successes…

    I agree wholeheartedly with this. I wonder what we would all be talking about if HRC had won by *more* than a few percent, and won a few states that we didn’t expect? That is, if the outcome had been opposite, but just as far from the consensus predictions? I am not sure we would be saying that the polls were all wrong and things need to be checked! Let’s try always to be critical of predictions and try to learn from them by “opening the black box” ex post facto.

    • David:

      Along those lines, I saw a post-election item somewhere that was, approvingly, quoting someone who, before the election, had said that Trump would win easily. But of course that didn’t happen either: from a quantitative perspective that was a poor prediction.

  14. Also, 538 had a plan B since they forecast the race being tight. They turned off their model for close races and I believe recalibrated the signal on the remaining wins. Their blog post was Tuesday evening was brief, but perhaps you have more insight on what Plan B entailed. Also, it is similar to your exercise that evening on updating probabilities as results came in. Perhaps a discussion of election forecasts and election nowcasts? Or whether or not political science has refined the nowcast methods?

  15. What I find interesting about this election is how while the predictors were aware of the errors of their models to accurately predict percentages for the candidates in swing states, they didn’t fully consider the possiblity of their models being wrong. 538 did the best job of this, but they still failed. I don’t know if any model that would be applicable to other election years could have predicted all the states, but they certainly could have done better. What really bugs me is that some models use the statistical methods of meta-analysis or Monte Carlo methods but they aren’t doing the analysis well. I tried my hand at predicting the election (I am an undergraduate student so I kept it very basic) using a bayesian approach assumming a normal posterior. I know my model is far from ideal, but I wanted to try to understand the process. The point is there isn’t some simple answer to predicting an election using basic methods. The data is imperfect and people equated previous success (which wasn’t really successful considering the error of predicted mean) of models as an indication of future success. Perhaps the use of non-gaussian distribution and other methods should be considered.

    • Brittany:

      I disagree with your statement that modelers such as Nate “didn’t fully consider the possiblity of their models being wrong” and that they “failed.” Including a variance term in the model for nonsampling error is how to fully consider the possibility of the model being wrong. See the last quoted bit here. That said, one should try to figure out where the nonsampling error is coming from, in any particular case.

Leave a Reply

Your email address will not be published. Required fields are marked *