I’ve written a lot on polls and elections (“a poll is a snapshot, not a forecast,” etc., or see here for a more technical paper with Kari Lock) but had a few things to add in light of Sam Wang’s recent efforts. As a biologist with a physics degree, Wang brings an outsider’s perspective to political forecasting, which can be a good thing. (I’m a bit of an outsider to political science myself, as is my sometime collaborator Nate Silver, who’s done a lot of good work in the past few years.)
But there are two places where Wang misses the point, I think.
He refers to his method as a “transparent, low-assumption calculation” and compares it favorably to “fancy modeling” and “assumption-laden models.” Assumptions are a bad thing, right? Well, no, I don’t think so. Bad assumptions are a bad thing. Good assumptions are just fine. Similarly for fancy modeling. I don’t see why a model should get credit for not including a factor that might be important.
Let me clarify. If a simple model with only a couple of variables does as well, or almost as well, as a complicated effort that includes a lot more information, then, sure, that’s interesting.and suggests that all that extra modeling isn’t getting you much. Fine. But I don’t see that there’s anything wrong with putting in that additional info. In the elections context, it might not change your national forecasts much but it might help in individual districts.
Or, as Radford Neal put it in one of my favorite statistics quotes of all time:
Sometimes a simple model will outperform a more complex model . . . Nevertheless, I believe that deliberately limiting the complexity of the model is not fruitful when the problem is evidently complex. Instead, if a simple model is found that outperforms some particular complex model, the appropriate response is to define a different complex model that captures whatever aspect of the problem led to the simple model performing well.
Wang’s other mistake, I think, is his claim that “Pollsters sample voters with no average bias. Their errors are small enough that in large numbers, their accuracy approaches perfect sampling of real voting.” This is a bit simplistic, no? Nonresponse rates are huge and pollsters make all sorts of adjustments. In non-volatile settings such as a national general election, hey can do a pretty good job with all these adjustments, but it’s hardly a simple case of unbiased sampling.
Finally, I was interested in Wang’s claim that Nate’s estimates overstated their uncertainty. This would be interesting, if true, given the huge literature on overconfidence in forecasting.
Yang Hu gathered Nate’s latest probability forecasts for me, along with the election outcomes, and I checked the calibration as follows. I divided the House elections into bins: those where Nate gave a 0-10% chance of the Republcan winning, a 10-20% chance, etc., all the way up to 90-100%. For each of the ten categories, I counted the number of elections in that bin and the percentage where the Republicans actually won.
If the forecasts are perfectly calibrated, we’d expect to see the empirical frequency of Republican wins go smoothly from 0 to 1 as the bins go from forecast probability 0 to 1. If the forecasts are overconfident, we’d expect to see empirical frequencies closer to 0.5. If forecasts are underconfident (as Wang alleged), we’d see empirical frequencies closer to 0 and 1.
Here’s what we actually found
Forecast R win prob #cases Empirical R win freq
0-10% 165 0.01
10-20% 11 0.27
20-30% 5 0.20
30-40% 11 0.27
40-50% 6 0.50
50-60% 12 0.67
60-70% 10 0.90
70-80% 10 1
80-90% 19 1
90-100% 186 1
So, yes, Nate’s forecasts do seem
overunderconfident! Out of the 39 races where he gave the Republican candidate between 60% and 90% chance of winning, the Republicans snagged all 38. Apparently he could’ve tightened up his predictions a lot. Wang appears to be correct.
(Since the original posting, I updated up the above table to include all the election results.)
But . . . before jumping on Nate here, I’d suggest some caution. As we all know, district-by-district outcomes are highly correlated. And if you suppose that the national swing as forecast was off by 3 percentage points in either direction, you get something close to calibration
To put it another way: In statistics, we’re always looking for 95% confidence intervals. IN experimental physics, I wouldn’t be surprised if 99% is the standard. But what does a 95% interval mean in political terms? Midterm elections occur every four years–so, if you want an interval that is correct 19 times out of 20, you have to account for 80 years of contingencies. And nobody would consider fitting their models to 80-year-old election data, at least not without a lot of adjustment.
So, to put it another way: if you really want 95% intervals and true calibration, you’ll need uncertainties that are wide enough so that, most of the time, you’re gonna look overconfident. I don’t see any easy answer here, but it’s an issue which, as a Bayesian election modeler, I’ve been aware of for awhile. Usually I just take whatever model probabilities are given to me and go from there, without trying to think too hard about their calibration. That is, I’ll either take wide Silver-style intervals and treat them as all-encompassing forecasts, or I’ll take narrow Wang-style intervals and treat them as conditional on a model.
It’s ironic that Wang characterizes his method as less assumption-laden than Nate’s. it’s simpler and more transparent, and I agree that these are virtues, but ultimately I think it’s more model-based in the sense that one has to rely strongly on a model to map Wang’s poll averages to election predictions. That’s fine–I love models–I just think there’s room for endless confusion when “assumption-laden” is used as a putdown.
Full disclosure: I (among others) gave Nate a few suggestions on combining information for his forecasting model. But the model and the effort behind it are Nate’s own.
P.S. Just for laffs, we also evaluated the calibration of another set of forecasts. The Huffington Post only gave probability forecasts for 118 battleground races, which I augmented by taking all of Nate’s essentially certain races (probabilities less than .02 or more than .98) and counting them as certain for Huffington Post as well. Here’s the calibration summary for Huffington:
Forecast R win prob #cases Empirical R win freq
0-10% 130 0.01
10-20% 24 0.29
20-30% 7 0.43
30-40% 6 0.17
40-50% 4 0.75
50-60% 10 0.60
60-70% 7 0.86
70-80% 11 0.90
80-90% 25 1
90-100% 142 1
P.P.S. Sorry about the ugly formatting. Serves me right for using tables, I suppose.