Skip to content

David Rothschild and Sharad Goel called it (probabilistically speaking)


David Rothschild and Sharad Goel write:

In a new paper with Andrew Gelman and Houshmand Shirani-Mehr, we examined 4,221 late-campaign polls — every public poll we could find — for 608 state-level presidential, Senate and governor’s races between 1998 and 2014. Comparing those polls’ results with actual electoral results, we find the historical margin of error is plus or minus six to seven percentage points. . . .

Systematic errors imply that these problems persist, to a lesser extent, in poll averaging, as shown in the above graph.

David and Sharad conclude:

This November, we would not be at all surprised to see Mrs. Clinton or Mr. Trump beat the state-by-state polling averages by about two percentage points. We just don’t know which one would do it.

Yup. 2 percentage points. They wrote this on October 6, 2016.

Can a census-tract-level regression analysis untangle correlation between lead and crime?


Daniel Hawkins pointed me to a post by Kevin Drum entitled, “Crime in St. Louis: It’s Lead, Baby, Lead,” and the associated research article by Brian Boutwell, Erik Nelson, Brett Emo, Michael Vaughn, Mario Schootman, Richard Rosenfeld, Roger Lewis, “The intersection of aggregate-level lead exposure and crime.”

The short story is that the areas of St. Louis with more crime and poverty, had higher lead levels (as measured from kids in the city who were tested for lead in their blood).

Here’s their summary:

Screen Shot 2016-04-02 at 3.08.06 PM

I had a bit of a skeptical reaction—not about the effects of lead, I have no idea about that—but on the statistics. Looking at those maps above, the total number of data points is not large, and those two predictors are so highly correlated, I’m surprised that they’re finding what seem to be such unambiguous effects. In the abstract it says n=459,645 blood measurements and n=490,433 crimes, but for the purpose of the regression, n is the number of census tracts in their dataset, about 100.

So I contacted the authors of the paper and one of them, Erik Nelson, did some analyses for me.

First he ran the basic regression—no Poisson, no spatial tricks, just regression of log crime rate on lead exposure and index of social/economic disadvantage. Data are at the census tract level, and lead exposure is the proportion of kids’ lead measurements from that census tract that were over some threshold. I think I’d prefer a continuous measure but that will do for now.

In this simple regression, the coefficient for lead exposure was large and statistically significant.

Then I asked for a scatterplot: log crime rate vs. lead exposure, indicating census tracts with three colors tied to the terciles of disadvantage.

And here it is:


He also fit a separate regression line for each tercile of disadvantage.

As you can see, the relation between lead and crime is strong, especially for census tracts with less disadvantage.

Erik also made separate plots for violent and non-violent crimes. They look pretty similar:



In summary: the data are what they are. The correlation seems real, not just an artifact of a particular regression specification. It’s all observational so we shouldn’t overinterpret it, but the pattern seems worth sharing.

How effective (or counterproductive) is universal child care? Part 2

This is the second of a series of two posts.

Yesterday we discussed the difficulties of learning from a small, noisy experiment, in the context of a longitudinal study conducted in Jamaica where researchers reported that an early-childhood intervention program caused a 42%, or 25%, gain in later earnings. I expressed skepticism.

Today I want to talk about a paper making an opposite claim: “Canada’s universal childcare hurt children and families.”

I’m skeptical of this one too.

Here’s the background. I happened to mention the problems with the Jamaica study in a talk I gave recently at Google, and afterward Hal Varian pointed me to this summary by Les Picker of a recent research article:

In Universal Childcare, Maternal Labor Supply, and Family Well-Being (NBER Working Paper No. 11832), authors Michael Baker, Jonathan Gruber, and Kevin Milligan measure the implications of universal childcare by studying the effects of the Quebec Family Policy. Beginning in 1997, the Canadian province of Quebec extended full-time kindergarten to all 5-year olds and included the provision of childcare at an out-of-pocket price of $5 per day to all 4-year olds. This $5 per day policy was extended to all 3-year olds in 1998, all 2-year olds in 1999, and finally to all children younger than 2 years old in 2000.

(Nearly) free child care: that’s a big deal. And the gradual rollout gives researchers a chance to estimate the effects of the program by comparing children at each age, those who were and were not eligible for this program.

The summary continues:

The authors first find that there was an enormous rise in childcare use in response to these subsidies: childcare use rose by one-third over just a few years. About a third of this shift appears to arise from women who previously had informal arrangements moving into the formal (subsidized) sector, and there were also equally large shifts from family and friend-based child care to paid care. Correspondingly, there was a large rise in the labor supply of married women when this program was introduced.

That makes sense. As usual, we expect elasticities to be between 0 and 1.

But what about the kids?

Disturbingly, the authors report that children’s outcomes have worsened since the program was introduced along a variety of behavioral and health dimensions. The NLSCY contains a host of measures of child well being developed by social scientists, ranging from aggression and hyperactivity, to motor-social skills, to illness. Along virtually every one of these dimensions, children in Quebec see their outcomes deteriorate relative to children in the rest of the nation over this time period.

More specifically:

Their results imply that this policy resulted in a rise of anxiety of children exposed to this new program of between 60 percent and 150 percent, and a decline in motor/social skills of between 8 percent and 20 percent. These findings represent a sharp break from previous trends in Quebec and the rest of the nation, and there are no such effects found for older children who were not subject to this policy change.


The authors also find that families became more strained with the introduction of the program, as manifested in more hostile, less consistent parenting, worse adult mental health, and lower relationship satisfaction for mothers.

I just find all this hard to believe. A doubling of anxiety? A decline in motor/social skills? Are these day care centers really that horrible? I guess it’s possible that the kids are ruining their health by giving each other colds (“There is a significant negative effect on the odds of being in excellent health of 5.3 percentage points.”)—but of course I’ve also heard the opposite, that it’s better to give your immune system a workout than to be preserved in a bubble. They also report “a policy effect on the treated of 155.8% to 394.6%” in the rate of nose/throat infection.

OK, hre’s the research article.

The authors seem to be considering three situations: “childcare,” “informal childcare,” and “no childcare.” But I don’t understand how these are defined. Every child is cared for in some way, right? It’s not like the kid’s just sitting out on the street. So I’d assume that “no childcare” is actually informal childcare: mostly care by mom, dad, sibs, grandparents, etc. But then what do they mean by the category “informal childcare”? If parents are trading off taking care of the kid, does this count as informal childcare or no childcare? I find it hard to follow exactly what is going on in the paper, starting with the descriptive statistics, because I’m not quite sure what they’re talking about.

I think what’s needed here is some more comprehensive organization of the results. For example, consider this paragraph:

The results for 6-11 year olds, who were less affected by this policy change (but not unaffected due to the subsidization of after-school care) are in the third column of Table 4. They are largely consistent with a causal interpretation of the estimates. For three of the six measures for which data on 6-11 year olds is available (hyperactivity, aggressiveness and injury) the estimates are wrong-signed, and the estimate for injuries is statistically significant. For excellent health, there is also a negative effect on 6-11 year olds, but it is much smaller than the effect on 0-4 year olds. For anxiety, however, there is a significant and large effect on 6-11 year olds which is of similar magnitude as the result for 0-4 year olds.

The first sentence of the above excerpt has a cover-all-bases kind of feeling: if results are similar for 6-11 year olds as for 2-4 year olds, you can go with “but not unaffected”; if they differ, you can go with “less effective.” Various things are pulled out based on whether they are statistically significant, and they never return to the result for anxiety, which would seem to contradict their story. Instead they write, “the lack of consistent findings for 6-11 year olds confirm that this is a causal impact of the policy change.” “Confirm” seems a bit strong to me.

The authors also suggest:

For example, higher exposure to childcare could lead to increased reports of bad outcomes with no real underlying deterioration in child behaviour, if childcare providers identify negative behaviours not noticed (or previously acknowledged) by parents.

This seems like a reasonable guess to me! But the authors immediately dismiss this idea:

While we can’t rule out these alternatives, they seem unlikely given the consistency of our findings both across a broad spectrum of indices, and across the categories that make up each index (as shown in Appendix C). In particular, these alternatives would not suggest such strong findings for health-based measures, or for the more objective evaluations that underlie the motor-social skills index (such as counting to ten, or speaking a sentence of three words or more).

Health, sure: as noted above, I can well believe that these kids are catching colds from each other.

But what about that motor-skills index? Here are their results from the appendix:

Screen Shot 2016-06-22 at 1.56.04 PM

I’m not quite sure whether + or – is desirable here, but I do notice that the coefficients for “can count out loud to 10” and “spoken a sentence of 3 words or more” (the two examples cited in the paragraph above) go in opposite directions. That’s fine—the data are the data—but it doesn’t quite fit their story of consistency.

More generally, the data are addressed in an scattershot manner. For example:

We have estimated our models separately for those with and without siblings, finding no consistent evidence of a stronger effect on one group or another. While not ruling out the socialization story, this finding is not consistent with it.

This appears to be the classic error of interpretation of a non-rejection of a null hypothesis.

And here’s their table of key results:

Screen Shot 2016-06-22 at 1.59.53 PM

As quantitative social scientists we need to think harder about how to summarize complicated data with multiple outcomes and many different comparisons.

I see the current standard ways to summarize this sort of data are:

(a) Focus on a particular outcome and a particular comparison (choosing these ideally, though not usually, using preregistration), present that as the main finding and then tag all else as speculation.

Or, (b) Construct a story that seems consistent with the general pattern in the data, and then extract statistically significant or nonsignificant comparisons to support your case.

Plan (b) was what was done again, and I think it has problems: lots of stories can fit the data, and there’s a real push toward sweeping any anomalies aside.

For example, how do you think about that coefficient of 0.308 with standard error 0.080 for anxiety among the 6-11-year-olds? You can say it’s just bad luck with the data, or that the standard error calculation is only approximate and the real standard error should be higher, or that it’s some real effect caused by what was happening in Quebec in these years—but the trouble is that any of these explanations could be used just as well to explain the 0.234 with standard error 0.068 for 2-4-year-olds, which directly maps to one of their main findings.

Once you start explaining away anomalies, there’s just a huge selection effect in which data patterns you choose to take at face value and which you try to dismiss.

So maybe approach (a) is better—just pick one major outcome and go with it? But then you’re throwing away lots of data, that can’t be right.

I am unconvinced by the claims of Baker et al., but it’s not like I’m saying their paper is terrible. They have an identification strategy, and clean data, and some reasonable hypotheses. I just think their statistical analysis approach is not working. One trouble is that statistics textbooks tend to focus on stand-alone analyses—getting the p-value right, or getting the posterior distribution, or whatever, and not on how these conclusions fit into the big picture. And of course there’s lots of talk about exploratory data analysis, and that’s great, but EDA is typically not plugged into issues of modeling, data collection, and inference.

What to do?

OK, then. Let’s forget about the strengths and the weaknesses of the Baker et al. paper and instead ask, how should one evaluate a program like Quebec’s nearly-free preschool? I’m not sure. I’d start from the perspective of trying to learn what we can from what might well be ambiguous evidence, rather than trying to make a case in one direction or another. And lots of graphs, which would allow us to see more in one place, that’s much better than tables and asterisks. But, exactly what to do, I’m not sure. I don’t know whether the policy analysis literature features any good examples of this sort of exploration. I’d like to see something, for this particular example and more generally as a template for program evaluation.

Explanations for that shocking 2% shift


The title of this post says it all. A 2% shift in public opinion is not so large and usually would not be considered shocking. In this case the race was close enough that 2% was consequential.

Here’s the background:

Four years ago, Mitt Romney received 48% of the two-party vote and lost the presidential election. This year, polls showed Donald Trump with about 48% of the two-party vote. When the election came around, Trump ended up with nearly 50% of the two-party vote—according to the latest count, he lost to Hillary Clinton by only 200,000 vote. Because of the way the votes were distributed in states, Trump won the electoral college and thus the presidency.

In this earlier post I graphed the Romney-Trump swing by state and also made this plot showing where Trump did better or worse than the polls in 2016:


Trump outperformed the polls in several key swing states and also in lots of states that were already solidly Republican.

The quantitative pundits

Various online poll aggregators were giving pre-election probabilities ranging from 66% to 99%. These probabilities were high because Clinton had been leading in the polls for months; the probabilities were not 100% because it was recognized that the final polls might be off by quite a bit from the actual election outcome. Small differences in how the polls were averaged corresponded to large apparent differences in win probabilities; hence we argued that the forecasts that were appearing, were not so different as they seemed based on those reported odds.

The final summary is that the polls were off by about 2% (or maybe 3%, depending on which poll averaging you’re using), which, again, is a real error of moderate size that happened to be highly consequential given the distribution of the votes in the states this year. Also we ignored correlations in some of our data, thus producing illusory precision in our inferences based on polls, early voting results, etc.

What happened?

Several explanations have been offered. It’s hard at this point to adjudicate among them, but I’ll share what thoughts I have:

– Differential voter turnout. It says here that voter turnout was up nearly 5% from the previous election. Voter turnout could have been particularly high among Trump-supporting demographic and geographic groups. Or maybe not! it says here that voter turnout was down this year. Either way, the story would be that turnout was high for Republicans relative to Democrats, compared to Obama’s elections.

– Last-minute change in vote. If you believe the exit poll, Trump led roughly 48%-41% among the 14% of voters who say they decided in the past week. That corresponds to an bump in Trump’s 2-party vote percentage of approximately .14*(.48-.41)/2 = 0.005, or 1/2 a percentage point. That’s something.

– Differential nonresponse. During the campaign we talked about the idea that swings in the polls mostly didn’t correspond to real changes in vote preferences but rather came from changes in nonresponse patterns: when there was good news for Trump, his supporters responded more to polls, and when there was good news for Clinton, her supporters were more likely to respond. But this left the question of where was the zero point.

When we analyzed a Florida poll last month, adjusting for party registration, we gave Trump +1 in the state, while the estimates from the others ranged from Clinton +1 to Clinton +4. That gives you most of the shift right there. This was just one poll so I didn’t take it too seriously at the time but maybe I should’ve.

– Trump supporters not admitting their support to pollsters. It’s possible, but I’m skeptical of this mattering too much, given that Trump outperformed the polls the most in states such as North Dakota and West Virginia where I assume respondents would’ve had little embarrassment in declaring their support for him, while he did no better than the polls’ predictions in solidly Democratic states. Also, Republican candidates outperformed expectations in the Senate races, which casts doubt on the model in which respondents would not admitting they supported Trump; rather, the Senate results are consistent with differential nonresponse or unexpected turnout or opposition to Hillary Clinton.

– Third-party collapse. Final polls had Johnson at 5% of the vote. He actually got 3%, and it’s a reasonable guess that most of this 2% went to Trump.

– People dissuaded from voting because of long lines or various measures making it more difficult to vote. I have no idea how big or small this one is. This must matter a lot more in some states than in others.

I’m sure there are some other things I missed. Let me just emphasize that the goal in this exercise is to understand the different factors that were going on, not to identify one thing or another that could’ve flipped the election outcome. The election was so close that any number of things could’ve swung enough votes for that.

P.S. Two other parts of the story:

– Voter enthusiasm. The claim has been made that Trump’s supporters had more enthusiasm for their candidate. They were part of a movement (as with Obama 2008) in a way that was less so for Clinton’s supporters. That enthusiasm could transfer to unexpectedly high voter turnout, with the twist that this would be hard to capture in pre-election surveys if Trump’s supporters were, at the same time, less likely to respond to pollsters.

– The “ground game” and social media. One reason the election outcome came as a surprise is that we kept hearing stories about Hillary Clinton’s professional campaign and big get-out-the-vote operation, as compared to Donald Trump’s campaign which seemed focused on talk show appearances and twitter. But maybe the Trump’s campaign’s social media efforts were underestimated.

P.P.S. One more thing: I think one reason for the shock is that people are reacting not just to the conditional probability, Pr (Trump wins | Trump reaches Election Day with 48% of two-party support in the polls), but to the unconditional probability, Pr (Trump becomes president of the United States | our state of knowledge two years ago). That unconditional probability is very low. And I think a lot of the stunned reaction is in part that things got so far.

To use a poker analogy: if you’re drawing to an inside straight on the river, the odds are (typically) against you. But the real question is how you got to the final table of the WSOP in the first place.

A 2% swing: The poll-based forecast did fine (on average) in blue states; they blew it in the red states

The big story in yesterday’s election is that Donald Trump did about 2% better than predicted from the polls. Trump got 50% of the two-party vote (actually, according to the most recent count, Hillary Clinton won the popular vote, just barely) but was predicted to get only 48%.

First let’s compare the 2016 election to 2012, state by state:



Now let’s look at how the 2016 election turned out, compared to the polls:



This is interesting. In the blue states (those won by Obama in 2012), Trump did about as well as predicted from the polls (on average, but not in the key swing states of Pennsylvania and Florida). But in the red states, Trump did much better than predicted.

P.S. More here.

How effective (or counterproductive) is universal child care? Part 1

This is the first of a series of two posts.

We’ve talked before about various empirically-based claims of the effectiveness of early childhood intervention. In a much-publicized 2013 paper based on a study of 130 four-year-old children in Jamaica, Paul Gertler et al. claimed that a particular program caused a 42% increase in the participants’ earnings as young adults. (It was a longitudinal study, and these particular kids were followed up for 20 years.) At the time I expressed skepticism based on the usual reasons of the statistical significance filter, researcher degrees of freedom, and selection problems with the data.

A year later, Gertler et al. released an updated version of their paper, this time with the estimate downgraded to 25%. I never quite figured out how this happened, but I have to admit to being skeptical of the 25% number too.

One problem is that a lot of this research seems to be presented in propaganda form. For example:

From the published article: “A substantial literature shows that U.S. early childhood interventions have important long-term economic benefits.”

From the press release: “Results from the Jamaica study show substantially greater effects on earnings than similar programs in wealthier countries. Gertler said this suggests that early childhood interventions can create a substantial impact on a child’s future economic success in poor countries.”

These two quotes, taken together, imply that (a) these interventions have large and well-documented effects in the U.S., but (b) these effect are not as large as the 25% reported for the Jamaica study.

But how does that work? How large, exactly, were the “important long-term economic benefits”? An increase of 10% in earnings, perhaps? 15%? If so, then do they really have evidence that the Jamaica program had effects that were not only clearly greater from zero, but clearly greater than 10% or 15%?

I doubt it.

Rather, I suspect they’re trying to have it both ways, to simultaneously claim that their results are consistent with the literature and that they’re new and exciting.

I’m perfectly willing to believe that early childhood intervention can have large and beneficial effects, and that these effects could be even larger in Jamaica than in the United States. What I’m not convinced of is that this particular study offers the evidence that is claimed. I’m worried that the researchers are chasing noise. That is, it’s not clear to me how much they learned from this new experiment, beyond what they already knew (or thought they knew) from the literature.

This was the first of a series of two posts. Tune in tomorrow for part 2.

Election forecasting updating error: We ignored correlations in some of our data, thus producing illusory precision in our inferences

The election outcome is a surprise in that it contradicts two pieces of information: Pre-election polls and early-voting tallies. We knew that each of these indicators could be flawed (polls because of differential nonresponse; early-voting tallies because of extrapolation errors), but when the two pieces of evidence came to the same conclusion, they gave us a false feeling of near-certainty.

In retrospect, a key mistake in the forecast updating that Kremp and I did, was that we ignored the correlation in the partial information from early-voting tallies. Our model had correlations between state-level forecasting errors (but maybe the corrs we used were still too low, hence giving us illusory precision in our national estimates), but we did not include any correlations at all in the errors from the early-voting estimates. That’s why our probability forecasts were, wrongly, so close to 100% (as here).

What if NC is a tie and FL is a close win for Clinton?

On the TV they said that they were guessing that Clinton would win Florida in a close race and that North Carolina was too close to call.

Let’s run the numbers, Kremp:

> update_prob2(clinton_normal=list("NC"=c(50,2), "FL"=c(52,2)))
Pr(Clinton wins the electoral college) = 95%

That’s good news for Clinton.

What if both states are tied?

> update_prob2(clinton_normal=list("NC"=c(50,2), "FL"=c(50,2)))
Pr(Clinton wins the electoral college) = 90%

P.S. To be complete I should include all the states that were already called (KY, MA, etc.) but this would add essentially no information so I won’t bother.

OK, Ok, just to illustrate:

> update_prob2(trump_states=c("KY","IN"), clinton_states=c("IL","MA"), clinton_normal=list("NC"=c(50,2), "FL"=c(50,2)))
Pr(Clinton wins the electoral college) = 90%

You see, no change.

P.P.S. What if Florida is close but Clinton loses there?

> update_prob2(trump_states=c("FL"), clinton_normal=list("NC"=c(50,2), "FL"=c(50,1)))
Pr(Clinton wins the electoral college) = 75%
[nsim = 37716; se = 0.2%]

Her chance goes down to 75%. Still better than Trump’s 25%.

P.P.P.S. And what if NC and FL are both close but Trump wins both?

> update_prob2(trump_states=c("NC","FL"), clinton_normal=list("NC"=c(50,1), "FL"=c(50,1)))
Pr(Clinton wins the electoral college) = 65%

Election updating software update

When going through the Pierre-Antoine Kremp’s election forecasting updater program, we saw that it ran into difficulties when we started to supply information from lots of states. It was a problem with the program’s rejection sampling algorithm.

Kremp updated the program to allow an option where you could specify the winner in each state, and also give an estimate and standard deviation when you have some idea of the vote share.

Here’s an example, based on some out-of-date (from a few hours ago) estimates of Clinton getting 51.5% of the vote in Colorado, 51.5% in Florida, 52.7% in Iowa, 50.7% in Nevada, 52.2% in Ohio, 46.2% in Pennsylvania, and 56.7% in Wisconsin, with standard deviations of 2% in each case:

> update_prob2(clinton_normal = list("CO" = c(51.5, 2), "FL" = c(51.5, 2), "IA" = c(52.7, 2), "NV" = c(50.7, 2), "OH" = c(52.2, 2), "PA" = c(46.2, 2), "WI" = c(56.7, 2)))
[nsim = 100000; se = 0%]

Again, I don’t particularly trust those numbers. But, again, you can now play along and throw in as many states as you want in this way without worrying about the simulations crashing.

P.S. Kremp updated again. Go to his site, refresh it, download the new files on Github, and do some R and Stan!

Now that 7pm has come, what do we know?

(followup to this post)

On TV they said that Trump won Kentucky and Indiana (no surprise), Clinton won Vermont (really no surprise), but South Carolina, Georgia, and Virginia were too close to call.

I’ll run Pierre-Antoine Kremp’s program conditioning on this information, coding states that are “too close to call” as being somewhere between 45% and 55% of the two-party vote for each candidate:

> update_prob(trump_states = c("KY","IN"), clinton_states = c("VT"), clinton_scores_list=list("SC"=c(45,55), "GA"=c(45,55), "VA"=c(45,55)))
Pr(Clinton wins the electoral college) = 95%
[nsim = 65433; se = 0.1%]

Just a rough guess, still; obv this all depends on the polls-based model which was giving Clinton a 90% chance of winning before any votes were counted.

What might we know at 7pm?

To update our effort from 2008, let’s see what we might know when the first polls close.

At 7pm, the polls will be closed in the following states: KY, GA, IN, NH, SC, VT, VA.

Let’s list these in order of projected Trump/Clinton vote share: KY, IN, SC, GA, NH, VA, VT.

I’ll use Kremp’s updating program to compute Trump and Clinton’s probabilities of winning, under his model, for several different scenarios.

First, with no information except the pre-election polls:

> update_prob()
Pr(Clinton wins the electoral college) = 90%
[nsim = 100000; se = 0.1%]

Clinton has a 90% chance of winning

Now let’s consider the best possible scenario for Trump at 7pm, in which he wins Kentucky, Indiana, South Carolina, Georgia, New Hampshire, or Virginia (but not Vermont, cos let’s get serious):

> update_prob(trump_states = c("KY","IN","SC","GA","NH","VA"), clinton_states = c("VT"))
Pr(Clinton wins the electoral college) = 2%
[nsim = 1340; se = 0.4%]

Next-best option for Trump, he wins all the states except Virginia and Vermont:

> update_prob(trump_states = c("KY","IN","SC","GA","NH"), clinton_states = c("VA","VT"))
Pr(Clinton wins the electoral college) = 28%
[nsim = 3856; se = 0.7%]

Most likely scenario, Trump wins Kentucky, Indiana, South Carolina, and Georgia, but loses New Hampshire, Virginia, and Vermont:

> update_prob(trump_states = c("KY","IN","SC","GA"), clinton_states = c("NH","VA","VT"))
Pr(Clinton wins the electoral college) = 93%
[nsim = 88609; se = 0.1%]

Or Trump just wins Kentucky, Indiana, and South Carolina:

> update_prob(trump_states = c("KY","IN","SC"), clinton_states = c("GA","NH","VA","VT"))
Pr(Clinton wins the electoral college) = 100%
[nsim = 5240; se = 0%]

P.S. Kremp writes:

If you remember, I have a polling error term in my forecast, so all polls can be off in any given state by the same amount. And these polling errors are correlated across states. I picked a 0.7 correlation—which may be a bit high. It made the model more conservative about Clinton’s chances, but today, it’s going to make it jump to conclusions when the first results come in.

Interesting. I’m not sure if Kremp’s model really is overreacting: I’d guess that errors across states will have a very high correlation. I guess we’ll see once all the data come in.

P.P.S. I forgot Kentucky in my first version of this post. Kentucky was never going to be close so including it does not change the numbers at all. But for completeness I updated the code.

Blogging the election at Slate

Slate blog is here. Feel free to place any of your comments at this blog right here.

Updating the Forecast on Election Night with R

Pierre-Antoine Kremp made this cool widget that takes his open-source election forecaster (it aggregates state and national polls using a Stan program that runs from R) and computes conditional probabilities.

Here’s the starting point, based on the pre-election polls and forecast information:


These results come from the fitted Stan model which gives simulations representing a joint posterior predictive distribution for the two-party vote shares in the 50 states.

But what happens as the election returns come in? Kremp wrote an R program that works as follows: He takes the simulations and approximates them by a multivariate normal distribution (not a bad approximation, I think, given that we’re not going to be using this procedure to estimate extreme tail probabilities; also remember that the 50 state outcomes are correlated, and that correlation is implicitly included in our model), then when he wants to condition on any outcome (for example, Trump winning Florida, New Hampshire, and North Carolina), the program can just draw a bunch of simulations from the multivariate normal, just keep the the simulations that satisfy the condition, and compute an electoral college distribution from there.

Hey, let’s try it out! It’s all in R:

> source("update_prob.R")
> update_prob()
Pr(Clinton wins the electoral college) = 90%
[nsim = 100000; se = 0.1%]
> update_prob(trump_states = c("FL", "NH", "NC"))
Pr(Clinton wins the electoral college) = 4%
[nsim = 3284; se = 0.3%]

OK, if Trump wins Florida, New Hampshire, and North Carolina, then Clinton’s toast. This isn’t just from the “electoral college math”; it’s also cos the votes in different states are correlated in the predictive distribution; thus Trump winning these 3 states is not just a bunch of key electoral votes for him, it would also be an indicator that he’s doing much better than expected nationally.

What if Trump wins Florida and North Carolina but loses New Hampshire?

> update_prob(trump_states = c("FL", "NC"), clinton_states=c("NH"))
Pr(Clinton wins the electoral college) = 58%
[nsim = 15582; se = 0.4%]

Then it could go either way.

Hmmm, let’s try some more things. Slate’s “Votecastr” project says here that “Based on the 1.66 million early votes VoteCastr has run through its model, Clinton leads Trump by 2.7 points, 46.3 percent to 43.6 percent.” That’s .463/(.463+.436) = 51.5% of the 2-party vote.

Let’s take this as a guess of the outcome in Florida, with an uncertainty of +/- 2 percentage points, so that Clinton gets between 49.5% and 53.5% of the two-party vote:

> update_prob(clinton_scores_list = list("CO" = c(49.5, 53.5)))
Pr(Clinton wins the electoral college) = 88%
[nsim = 73403; se = 0.1%]

That would be good news for Clinton: she’s expected to get around 53% of the vote in Florida so a close vote in that state would, by itself, not convey much additional information. (The computations in Kremp’s program are not saved; that is, my inference just above does not condition on my earlier supposition of Trump winning Florida, New Hampshire, and North Carolina.)

Hey, here’s some news! It says that, based on the early vote in Florida, Clinton leads Trump 1,780,573 votes to 1,678,848 for Trump. Early vote isn’t the same as total vote but let’s take it as a starting point, so that’s a 2-party vote share of 1780573/(1780573 + 1678848) = .515 for Clinton in Florida. Let’s put that in the program too, again assuming it could be off by 2% in either direction:

> update_prob(clinton_scores_list = list("CO" = c(49.5, 53.5), "FL" = c(49.5, 53.5)))
Pr(Clinton wins the electoral college) = 96%
[nsim = 57108; se = 0.1%]

Hey, this looks like good news for Hillary.

Now let’s feed in some more information from that same page:

Iowa: Clinton 273,188, Trump 244,739. Clinton share 273188/(273188 + 244739) = 52.7%

Nevada: Clinton 276,461, Trump 269,255. Clinton share 50.7%

Ohio: Clinton 632,433, Trump 579,916. Clinton share 52.2%

Pennsylvania: Clinton 85,367, Trump 99,286. Clinton share 46.2%

Wisconsin: Clinton 295,302, Trump 225,281. Clinton share 56.7%

Hmmm . . . I don’t really trust those Pennsylvania numbers, based on such a small fraction of the vote. Wisconsin also seems a bit extreme. But let’s just run it and see what we get:

> update_prob(clinton_scores_list = list("CO" = 51.5+c(-2,2), "FL" = 51.5+c(-2,2), "IA" = 52.7+c(-2,2), "NV" = 50.7+c(-2,2), "OH" = 52.2+c(-2,2), "PA" = 46.2+c(-2,2), "IA" = 56.7+c(-2,2)))
Error in draw_samples(clinton_states = clinton_states, trump_states = trump_states,  : 
  rmvnorm() is working hard... but more than 99.99% of the samples are rejected; you should relax some contraints.

That’s kind of annoying, I’d rather have a more graceful message but the short version is that there are so many constraints here, and some of them are inconsistent with the model, so the rejection sampling algorithm is failing.

Let’s redo, making these +/- 4% rather than +/- 2% for each state. For convenience I’ll write a function:

partial_outcomes <- function(states, clinton_guess, error_range){
  n <- length(states)
  output <- as.list(rep(NA, n))
  names(output) <- states
  for (i in 1:n){
    output[[i]] <- clinton_guess[i] + error_range*c(-1,1)

Now I'll run it:

> update_prob(clinton_scores_list = partial_outcomes(c("CO", "FL", "IA", "NV", "OH", "PA", "WI"), c(51.5, 51.5, 52.7, 50.7, 52.2, 46.2, 56.7), 4))
Error in draw_samples(clinton_states = clinton_states, trump_states = trump_states,  : 
  rmvnorm() is working hard... but more than 99.99% of the samples are rejected; you should relax some contraints.

Still blows up. OK, let's remove Pennsylvania and Wisconsin:

> update_prob(clinton_scores_list = partial_outcomes(c("CO", "FL", "IA", "NV", "OH"), c(51.5, 51.5, 52.7, 50.7, 52.2), 4))
Pr(Clinton wins the electoral college) = 100%
[nsim = 16460; se = 0%]

And here are the projected results for all 50 states, under these assumptions:

> update_prob(clinton_scores_list = partial_outcomes(c("CO", "FL", "IA", "NV", "OH"), c(51.5, 51.5, 52.7, 50.7, 52.2), 4), show_all_states = TRUE)
Pr(Clinton wins) by state, in %:
[1,]  0  0  0 27 100 100 100 100 96 17 100 21  0 100  0  0  0  0 100 100 100 100 100  0  0  0 94  0  0
[1,] 100 100 100 93 100 66  0 100 100 100  4  0  0  0  0 100 100 100 100  0  0 100  95 100   0  19   0
Pr(Clinton wins the electoral college) = 100%
[nsim = 16523; se = 0%]

OK, you get the idea. If you think these early voting numbers are predictive, Clinton's in good shape.

How to run the program yourself

Just follow Kremp's instructions (I fixed one typo here):

To use update_prob(), you only need 2 files, available from my [Kremp's] GitHub repository:

- update_prob.R, which loads the data and defines the update_prob() function,

- last_sim.RData, which contains 4,000 simulations from the posterior distribution of the last model update.

Put the files into your working directory or use setwd().

If you don’t already have the mvtnorm package installed, you can do it by typing install.packages("mvtnorm") in the console.

To create the functions and start playing, type source("update_prob.R"), and the update_prob() function will be in your global environment.

The function accepts the following arguments:

- clinton_states: a character vector of states already called for Clinton;

- trump_states: a character vector of states already called for Trump;

- clinton_scores_list: a list of elements named with 2-letter state abbreviations; each element should be a numeric vector of length 2 containing the lower and upper bound of the interval in which Clinton share of the Clinton + Trump score is expected to fall.

- target_nsim: an integer indicating the minimum number of samples that should be drawn from the conditional distribution (set to 1000 by default).

- show_all_states: a logical value indicating whether to output the state by state expected win probabilities (set to FALSE by default).

It really works (as demonstrated by my examples above).

Running the program

Kremp just whipped up this program in a couple hours and it's pretty good. Had we had more time, we would've built some sort of online app for it. Also, on the technical level, the rejection sampling is crude, and as you start getting information from more and more states, the program breaks down.

The funny thing is, the whole thing would be trivial to implement in Stan! Had I thought of this yesterday, I could've already done it. Even now, maybe there's time.

But, for now, you can run the script as is, and it will work given data for few states.

Recently in the sister blog and elsewhere

Why it can be rational to vote (see also this by Robert Wiblin, “Why the hour you spend voting is the most socially impactful of all”)

Be skeptical when polls show the presidential race swinging wildly

The polls of the future will be reproducible and open source

Testing the role of convergence in language acquisition, with implications for creole genesis

Also I’m supposed to be blogging on the election at Slate later today, but I’m not quite sure what the link is for that.

P.S. I’m live-blogging now at Slate; it’s here.

“Another terrible plot”

Till Hoffman sent me an email with the above subject line and the following content:

These plots from the Daily Mail in the UK probably belong in your hall of fame of terrible visualisations:

I was gonna click on this, but then I thought . . . the Daily Mail? Even I have limits on how far I will go to waste my time. So I did not click. I recommend you don’t either.

What is the chance that your vote will decide the election? Ask Stan!

I was impressed by Pierre-Antoine Kremp’s open-source poll aggregator and election forecaster (all in R and Stan with an automatic data feed!) so I wrote to Kremp:

I was thinking it could be fun to compute probability of decisive vote by state, as in this paper. This can be done with some not difficult but not trivial manipulations of your simulations. Attached is some code from several years ago. I’ll have to remember exactly what all the steps were but I don’t think it will be hard to figure this all out. Are you interested in doing this? It would be fun, and we could get it out there right away.

And he did it! We went back and forth a bit on the graphs and he ended up with this map:


Best places to be a voter, in terms of Pr(decisive) are New Hampshire and Colorado (in either state the probability that your vote determines the election is 1 in a million); Nevada, Wisconsin, or Pennsylvania (1 in 2 million), Michigan, New Mexico, North Carolina, or Florida (1 in 3 million), or Maine (1 in 5 million). At the bottom of the list are a bunch of states like Maryland, Vermont, Idaho, Wyoming, and Oklahoma where you can forget about your vote making any difference in the electoral college.

(That said, I’d recommend voting for president even in those non-swing states because your vote can still determine who will win in the popular vote, or it might be enough to cause a change in the rounded popular vote, for example changing the outcome from 50-50 (to the nearest percentage point) to 51-49. Or enough to make the vote margin in 2016 exceed Obama’s margin over Romney in 2012. Any of these can affect perceptions of legitimacy and mandates, which could be a big deal in the election’s aftermath.)

We also made a graph similar to the one from our paper from the 2008 election, decomposing the probability of decisive vote as:

Pr (your vote is decisive) = Pr (your state’s electoral votes are necessary for the winning candidate) * Pr (the vote in your state is tied | your state’s electoral votes are necessary)

We ignore the thing with Maine and Nebraska possibly splitting their electoral votes, and we assign DC’s 3 votes to the Democrats.

Here’s the graph showing Pr (your state’s electoral votes are necessary) vs. Pr (the vote in your state is tied | your state’s electoral votes are necessary):


The diagonals are iso-lines for constant Pr(decisive), so you see the swing states on the upper right of the graph and the less close states on the left, in the 1-in-a-billion range and below.

Despite these probabilities being low, I do think it can be rational to vote, for the reasons discussed in this paper (or more briefly here, or even more briefly in my article scheduled to appear today in Slate); see also my comment here for clarification of some common points of confusion on this issue.

P.S. By the way, and speaking of reproducible research, I’m really glad that I had the code that I used to those calculations back in 2008, and also that I’d written it up as a paper. It would’ve been a bit of work to reconstruct the calculations with only the code or only the written description. But with both available, it was a piece of cake.

They say it because it’s true . . .

. . . We really do have the best comment section on the internet.

Different election forecasts not so different

Yeah, I know, I need to work some on the clickbait titles . . .

Anyway, people keep asking me why different election forecasts are so different. At the time of this writing, Nate Silver gives Clinton a 66.2% [ugh! See Pedants Corner below] chance of winning the election while Drew Linzer, for example, gives her an 87% chance.

So . . . whassup? In this post from last week, we discussed some of the incentives operating for Nate and other forecasters.

Here I want to talk briefly about the math. Or, I should say, the probability theory. The short story is that small differences in the forecast map to apparently large differences in probabilities. As a result, what look like big disagreements (66% compared to 87%!) don’t mean as much as you might think.

One way to see this is to look, not at the probabilities of each winning but at forecast vote share.

Here’s Nate:




And here’s Pierre-Antoine Kremp’s open-source version of Linzer’s model:


It’s hard to see at this level of resolution but Pierre’s forecast gives Clinton 52.5% of the two-party vote, which is not far from (.476/(.476+.423) = 52.9% of the two-party vote) and Nate Silver (.485/(.485+.454) = 51.7% of the two-party vote).

That’s right: Nate and the others differ by about 1% in their forecasts of Clinton’s vote share. 1% isn’t a lot, it’s well within any margin of error even after you’ve averaged tons of polls, because nonsampling error doesn’t average to zero.

So, argue about these different forecasts all you want, but from the standpoint of evidence they’re not nearly as different as they look on the probability scale.

To put it another way: suppose the election happens and Hillary Clinton receives 52% of the two-party vote. Or 51%. Or 53%. It’s not like then we’ll be able to adjudicate between the different forecasts and say, Nate was right or Drew was right or whatever. And we can’t get much out of using the 50 state outcomes as a calibration exercise. They’re just too damn correlated.

P.S. All these poll aggregators are have been jumping around because of differential nonresponse. If you polls’ reported summaries as your input, as all these methods do, you can’t avoid this problem. The way to smooth out these jumps is to adjust for the partisan composition of the surveys.

Pedants Corner: As I discussed a few years ago, reporting probabilities such as “66.2%” is just ridiculous in this context. It’s innumerate, as if you went around saying that you weigh 143.217 pounds. When this point came up in the 2012 election, a commenter wrote, “Based on what Nate Silver said in his Daily Show appearance I surmise he’d be one of the first to agree that it’s completely silly to expect daily commentary on decimal-point shifts in model projections to be meaningful…. yet it seems to be that his deal with NYT requires producing such commentary.” But I guess that’s not the case because Nate now has his own site and he still reports his probabilities to this meaningless precision. No big deal but I’m 66.24326382364829790019279981238980080123% sure he’s doing the wrong thing. And I say this as an occasional collaborator of Nate who respects him a lot.

Why I prefer 50% rather than 95% intervals

I prefer 50% to 95% intervals for 3 reasons:

1. Computational stability,

2. More intuitive evaluation (half the 50% intervals should contain the true value),

3. A sense that in aplications it’s best to get a sense of where the parameters and predicted values will be, not to attempt an unrealistic near-certainty.

This came up on the Stan list the other day, and Bob Carpenter added:

I used to try to validate with 95% intervals, but it was too hard because there weren’t enough cases that got excluded and you never knew what to do if 4 cases out of 30 were outside the 95% intervals.

(3) is a two-edged sword because I think people will be inclined to “read” the 50% intervals as 95% intervals out of habit, expecting higher coverage than they have. But I like the point about not trying to convey an unrealistic near-certainty (which is exactly how I think people look at 95% intervals because the p value convention at .05).

And remember to call them uncertainty intervals.

Modeling statewide presidential election votes through 2028

David Leonhardt of the NYT asked a bunch of different people, including me, which of various Romney-won states in 2012 would be likely to be won by a Democrat in 2020, 2024, or 2028, and which of various Obama-won states would go for a Republican in any of those future years.

If I’m going to do this at all, I’ll do it by fitting a model. And this seemed like a good opportunity for me to learn to fit some time series in R.

I decided to fit a Gaussian process model (following the lead of Aki in the birthday problem) with separate time series for each state and each region, with the country partitioned into 10 regions that I set up to roughly cluster the states based on past voting trends. For the national level time trend I just assumed independent draws from a common distribution because the results jump around a lot from year to year. And I fit to data from 1976, just cos Yair happened to send me a state-by-state electoral dataset from 1976, but also because 1972 and before looked different enough that it didn’t really seem to make so much sense to include those early years in the analysis.

For Gaussian processes it can be tricky to estimate length-scale parameters without including some regularization. In this case I played around with a few options and ended up modeling each state and each region as the sum of two Gaussian processes, which meant I needed short and long length scales. After some fitting, I ended up just hard-coding these at 8 and 3, that is, 32 years and 12 years. The long scale is there to allow low-frequency trends which will stop the model from automatically regressing to the mean when extrapolated into the future; the short scale fits the curves you can see in the data.

I’ll first graph for you the data (1976-2012) and future 80% predictive intervals. The intervals for 2016 come from Pierre-Antoine Kremp’s open-source Stan-based forecast from last week; the intervals for 2020-2028 come from simulations of our fitted model in the generated quantities block of our Stan program.

I have a bit more to say about the model. But, first, here are the results:


The forecast intervals are wide because just about anything could happen with national swings. Recall that our data includes the 1984 Reagan landslide, on one extreme, and some solid victories for Bill Clinton and Barack Obama on the other. That’s all fine—a forecast has its uncertainty, after all—but it makes it hard to see much from these forecasts.

So I made a new set of graphs showing each state relative to the national average popular vote. To calculate the popular vote for future elections I simply plugged in the state-by-state two-party vote from 2012, which isn’t perfect but will do the job for this purpose. And here’s what we got:
Continue reading ‘Modeling statewide presidential election votes through 2028’ »