Skip to content
 

Updating the Forecast on Election Night with R

Pierre-Antoine Kremp made this cool widget that takes his open-source election forecaster (it aggregates state and national polls using a Stan program that runs from R) and computes conditional probabilities.

Here’s the starting point, based on the pre-election polls and forecast information:

screen-shot-2016-11-08-at-12-09-18-pm

These results come from the fitted Stan model which gives simulations representing a joint posterior predictive distribution for the two-party vote shares in the 50 states.

But what happens as the election returns come in? Kremp wrote an R program that works as follows: He takes the simulations and approximates them by a multivariate normal distribution (not a bad approximation, I think, given that we’re not going to be using this procedure to estimate extreme tail probabilities; also remember that the 50 state outcomes are correlated, and that correlation is implicitly included in our model), then when he wants to condition on any outcome (for example, Trump winning Florida, New Hampshire, and North Carolina), the program can just draw a bunch of simulations from the multivariate normal, just keep the the simulations that satisfy the condition, and compute an electoral college distribution from there.

Hey, let’s try it out! It’s all in R:

> source("update_prob.R")
> update_prob()
Pr(Clinton wins the electoral college) = 90%
[nsim = 100000; se = 0.1%]
> update_prob(trump_states = c("FL", "NH", "NC"))
Pr(Clinton wins the electoral college) = 4%
[nsim = 3284; se = 0.3%]

OK, if Trump wins Florida, New Hampshire, and North Carolina, then Clinton’s toast. This isn’t just from the “electoral college math”; it’s also cos the votes in different states are correlated in the predictive distribution; thus Trump winning these 3 states is not just a bunch of key electoral votes for him, it would also be an indicator that he’s doing much better than expected nationally.

What if Trump wins Florida and North Carolina but loses New Hampshire?

> update_prob(trump_states = c("FL", "NC"), clinton_states=c("NH"))
Pr(Clinton wins the electoral college) = 58%
[nsim = 15582; se = 0.4%]

Then it could go either way.

Hmmm, let’s try some more things. Slate’s “Votecastr” project says here that “Based on the 1.66 million early votes VoteCastr has run through its model, Clinton leads Trump by 2.7 points, 46.3 percent to 43.6 percent.” That’s .463/(.463+.436) = 51.5% of the 2-party vote.

Let’s take this as a guess of the outcome in Florida, with an uncertainty of +/- 2 percentage points, so that Clinton gets between 49.5% and 53.5% of the two-party vote:

> update_prob(clinton_scores_list = list("CO" = c(49.5, 53.5)))
Pr(Clinton wins the electoral college) = 88%
[nsim = 73403; se = 0.1%]

That would be good news for Clinton: she’s expected to get around 53% of the vote in Florida so a close vote in that state would, by itself, not convey much additional information. (The computations in Kremp’s program are not saved; that is, my inference just above does not condition on my earlier supposition of Trump winning Florida, New Hampshire, and North Carolina.)

Hey, here’s some news! It says that, based on the early vote in Florida, Clinton leads Trump 1,780,573 votes to 1,678,848 for Trump. Early vote isn’t the same as total vote but let’s take it as a starting point, so that’s a 2-party vote share of 1780573/(1780573 + 1678848) = .515 for Clinton in Florida. Let’s put that in the program too, again assuming it could be off by 2% in either direction:

> update_prob(clinton_scores_list = list("CO" = c(49.5, 53.5), "FL" = c(49.5, 53.5)))
Pr(Clinton wins the electoral college) = 96%
[nsim = 57108; se = 0.1%]

Hey, this looks like good news for Hillary.

Now let’s feed in some more information from that same page:

Iowa: Clinton 273,188, Trump 244,739. Clinton share 273188/(273188 + 244739) = 52.7%

Nevada: Clinton 276,461, Trump 269,255. Clinton share 50.7%

Ohio: Clinton 632,433, Trump 579,916. Clinton share 52.2%

Pennsylvania: Clinton 85,367, Trump 99,286. Clinton share 46.2%

Wisconsin: Clinton 295,302, Trump 225,281. Clinton share 56.7%

Hmmm . . . I don’t really trust those Pennsylvania numbers, based on such a small fraction of the vote. Wisconsin also seems a bit extreme. But let’s just run it and see what we get:

> update_prob(clinton_scores_list = list("CO" = 51.5+c(-2,2), "FL" = 51.5+c(-2,2), "IA" = 52.7+c(-2,2), "NV" = 50.7+c(-2,2), "OH" = 52.2+c(-2,2), "PA" = 46.2+c(-2,2), "IA" = 56.7+c(-2,2)))
Error in draw_samples(clinton_states = clinton_states, trump_states = trump_states,  : 
  rmvnorm() is working hard... but more than 99.99% of the samples are rejected; you should relax some contraints.

That’s kind of annoying, I’d rather have a more graceful message but the short version is that there are so many constraints here, and some of them are inconsistent with the model, so the rejection sampling algorithm is failing.

Let’s redo, making these +/- 4% rather than +/- 2% for each state. For convenience I’ll write a function:

partial_outcomes <- function(states, clinton_guess, error_range){
  n <- length(states)
  output <- as.list(rep(NA, n))
  names(output) <- states
  for (i in 1:n){
    output[[i]] <- clinton_guess[i] + error_range*c(-1,1)
  }
  return(output)
}

Now I'll run it:

> update_prob(clinton_scores_list = partial_outcomes(c("CO", "FL", "IA", "NV", "OH", "PA", "WI"), c(51.5, 51.5, 52.7, 50.7, 52.2, 46.2, 56.7), 4))
Error in draw_samples(clinton_states = clinton_states, trump_states = trump_states,  : 
  rmvnorm() is working hard... but more than 99.99% of the samples are rejected; you should relax some contraints.

Still blows up. OK, let's remove Pennsylvania and Wisconsin:

> update_prob(clinton_scores_list = partial_outcomes(c("CO", "FL", "IA", "NV", "OH"), c(51.5, 51.5, 52.7, 50.7, 52.2), 4))
Pr(Clinton wins the electoral college) = 100%
[nsim = 16460; se = 0%]

And here are the projected results for all 50 states, under these assumptions:

> update_prob(clinton_scores_list = partial_outcomes(c("CO", "FL", "IA", "NV", "OH"), c(51.5, 51.5, 52.7, 50.7, 52.2), 4), show_all_states = TRUE)
Pr(Clinton wins) by state, in %:
     AK AL AR AZ  CA  CO  CT  DE FL GA  HI IA ID  IL IN KS KY LA  MA  MD  ME  MI  MN MO MS MT NC ND NE
[1,]  0  0  0 27 100 100 100 100 96 17 100 21  0 100  0  0  0  0 100 100 100 100 100  0  0  0 94  0  0
      NH  NJ  NM NV  NY OH OK  OR  PA  RI SC SD TN TX UT  VA  VT  WA  WI WV WY ME1 ME2  DC NE1 NE2 NE3
[1,] 100 100 100 93 100 66  0 100 100 100  4  0  0  0  0 100 100 100 100  0  0 100  95 100   0  19   0
--------
Pr(Clinton wins the electoral college) = 100%
[nsim = 16523; se = 0%]

OK, you get the idea. If you think these early voting numbers are predictive, Clinton's in good shape.

How to run the program yourself

Just follow Kremp's instructions (I fixed one typo here):

To use update_prob(), you only need 2 files, available from my [Kremp's] GitHub repository:

- update_prob.R, which loads the data and defines the update_prob() function,

- last_sim.RData, which contains 4,000 simulations from the posterior distribution of the last model update.

Put the files into your working directory or use setwd().

If you don’t already have the mvtnorm package installed, you can do it by typing install.packages("mvtnorm") in the console.

To create the functions and start playing, type source("update_prob.R"), and the update_prob() function will be in your global environment.

The function accepts the following arguments:

- clinton_states: a character vector of states already called for Clinton;

- trump_states: a character vector of states already called for Trump;

- clinton_scores_list: a list of elements named with 2-letter state abbreviations; each element should be a numeric vector of length 2 containing the lower and upper bound of the interval in which Clinton share of the Clinton + Trump score is expected to fall.

- target_nsim: an integer indicating the minimum number of samples that should be drawn from the conditional distribution (set to 1000 by default).

- show_all_states: a logical value indicating whether to output the state by state expected win probabilities (set to FALSE by default).

It really works (as demonstrated by my examples above).

Running the program

Kremp just whipped up this program in a couple hours and it's pretty good. Had we had more time, we would've built some sort of online app for it. Also, on the technical level, the rejection sampling is crude, and as you start getting information from more and more states, the program breaks down.

The funny thing is, the whole thing would be trivial to implement in Stan! Had I thought of this yesterday, I could've already done it. Even now, maybe there's time.

But, for now, you can run the script as is, and it will work given data for few states.

12 Comments

  1. Keith O'Rourke says:

    > the rejection sampling is crude, and as you start getting information from more and more states, the program breaks down.
    Not necessarily a bad thing if you don’t have time to do adequate model checking.

    > This is a very simple and arguably dumb sampling algorithm. (from Kremp’s GitHub repository).
    It is so a good first scaffolding to erect.

    Though the algorithm is not simple and obvious to some (many private comments).

    An animation of how such a dumb sample algorithm implements Bayes logic and samples from the posterior https://galtonbayesianmachine.shinyapps.io/GaltonBayesianMachine/

    An allegory to explain that https://phaneron0.files.wordpress.com/2012/11/orangetree.ppt

  2. Mark says:

    This would be a perfect case for a shiny app :-).

  3. Thanks Andrew!

    If anybody has time to work on a Stan model and hook it with the update_prob() function, please send me a pull request on GitHub and I will merge it!

  4. caewok says:

    With a fast computer, you can get away with a lot more samples from the posterior in a reasonable time, which lets you put in more complex restrictions. Try changing the rmvnorm number in update_prob.R to 1e6, and change the stop parameter to 99.999: nrow(sim)/(nrow(proposals)*n) < 1-99.999/100.

  5. When I type source(“update_prob.R”), I get the error message:

    Error in source(“update_prob.R”) : update_prob.R:5:1: unexpected ‘<'
    4:
    5: <
    ^

    If this is due to an error of mine (as I suspect), please disregard … I'll figure it out when I get a chance.

  6. Thank you, Andrew. I tried that and got the same error–but then saw that there were some packages I hadn’t downloaded (which Kremp listed). That’s my next step.

    • I finally got this working (and figured out what was going on)! The file update_prob.R was being converted to HTML during download; to prevent this, I downloaded and unpacked the zip file. That solved the problem, and everything worked from there.

      Now I have installed RStan as well and look forward to playing with it.

  7. Could the reported “glitch” that occurred on election night for the Slate “Votecastr” project that required the data to be pulled have been caused by the BOT-HACK of the fake online BOT generated Trump “polling” that was HACKED into the VOTING MACHINES as actual Election Night votes to HIJACK the election when Hillary was so consistently far ahead? If so can you explain the process?

Leave a Reply