Commenter BorisG asked why I made my pretty maps using pre-election polls rather than exit polls. I responded in the comments but then I did a search and noticed that Kos had a longer discussion of this point on his blog. So I thought maybe it would help to discuss this all further here.
To start with, I appreciate the careful scrutiny. One of my pet peeves is people assuming a number or graph is correct, just because it has been asserted. BorisG and Kos and others are doing a useful service by subjecting my maps to criticism.
Several issues are involved:
– Data availability;
– Problems with pre-election polls;
– Problems with election polls;
– Differences between raw data and my best model-assisted estimates;
– Thresholding at 50%;
– Small sample sizes in some states.
I’ll discuss each of these issues in turn.
Data availability. The exit polls became available on election night at CNN.com and I used them extensively when I stayed up all night in Chicago crunching numbers. But I didn’t–and still don’t–have raw election poll numbers. I’ve asked and asked but with no success yet.
Meanwhile, there were lots of pre-election polls, but only a few weeks ago was I able to get any raw data from them, when, at my request, the people at Pew graciously posted their data on their website. So I used what was available.
Kos wrote, “Pew didn’t break down those categories by race — Gelman filled in the blanks with statistical sleight of hand.” This isn’t correct. Pew actually gave me raw data, a total of approximately 30,000 respondents.
Problems with pre-election polls. As Kos and others have noted, pre-election polls have problems, most notably that they are surveying people months before the election and that these people did not necessarily vote. On the upside, as Nate and others have found, pre-election polls match actual elections very well on a state-by-state basis–Nate’s pre-election poll aggregations were within 1% of the election outcome in 22 states and with 3% in 39 out of 50 states–so they’re not as bad as you might think.
Problems with exit polls. Exit polls have the advantage of reaching people right after they voted, but they have lots of other problems. In 2008 and in earlier years as well, exit polls had huge nonresponse biases. I’m still willing to use exit polls–you gotta use what you got–but we’re deluding ourselves if we think they’re a gold standard.
Disagreements between pre-election and exit polls. In Red State, Blue State we replicated many of our 2000 and 2004 analyses with Annenberg pre-election polls and Voter News Service exit polls. We generally found similar but of course not identical results. When studying separate income categories in small states, you really do run into sample size issues.
Beyond this, there can be systematic differences. Consider this graph that I posted awhile ago of the overall pattern of income and voting in 2008, separately looking at Pew and exit polls:
The two sources of data disagree on the very important question of whether the richest voters went for Obama or McCain. (The highest income category from the Pew surveys is “$150,000+”, so we can’t do a direct comparison at the top.) There’s no easy answer here and no easy resolution; there really just is a difference in what these polls say. And, no, I don’t necessarily believe the exit polls just because they are surveying actual voters. Again, exit polls have problems too.
Non-Hispanic whites. As I wrote in my explanation, in my maps of whites I included those survey respondents who described themselves as white and not Hispanic, because I think this most closely captures what commentators are talking about when they talk about “white voters.” I admit, though, that my graphs were simply labeled “Whites only,” which might have misled people into thinking that I was including Hispanic whites as well.
Differences between raw data and my best model-assisted estimates. Because of sample size, you can’t necessarily trust the raw data in any given state.
Consider Kos’s example of Colorado. Here are the raw data from the Pew exit poll for the five income categories:
Obama 26 28 67 51 15
McCain 2 36 45 56 16
This gives McCain 46% of the vote in the state which actually comes out about right. Things actually come out slightly differently when you include the survey weights. In any case, let’s now break it up by ethnicity
Obama 9 25 51 41 12
McCain 2 32 43 50 15
Obama 4 0 5 2 0
McCain 0 0 0 0 0
Obama 12 3 2 3 3
McCain 0 0 2 5 0
Obama 0 0 0 2 0
McCain 0 0 0 0 1
Other/decline to state:
Obama 1 0 9 3 0
McCain 0 4 0 1 0
Here are my estimates of the share of McCain’s share of the vote in each of the five income categories in Colorado:
0.35 0.43 0.48 0.53 0.57
And here are my estimates just for non-Hispanic whites in Colorado:
0.48 0.51 0.54 0.56 0.58
Because of the small sample size, I couldn’t just take the raw numbers. But I’m wiling to believe there are problems with the model, and I’m amenable to working to improve it.
Thresholding at 50%. In New Hampshire, for example, I estimate McCain as getting 50.2% of the two-party vote among whites in the lowest income category, 50.3% of the next income category, then 50.1%, 50.0%, and 49.8%. I agree that the sharp color change of the map can make things look more dramatic than they really are. I was giving my best estimates, recognizing that some states are essentially tied.
Small sample sizes in some states. Kos is right–there’s something wrong with my New Hamphire numbers. McCain only won 45% of the vote in New Hampshire, and the state is nearly 100% white, so clearly I am mistaken in giving him 50% of the vote in each income category. My guess as to what is happening here is that, with such a small sample size in the state, the model shifted the estimates over to what was happening in other states. This wasn’t a problem in my map of all voters, because i adjusted the total to the actual vote, but for the map of just whites it was a problem because my model didn’t “know” that New Hampshire was nearly all white. In the fuller model currently being fit by Yair, this problem will be solved, because we’ll be averaging over population categories within each state.
In the meantime, though, yeah, I should’ve realized New Hampshire had a problem.
More generally, in the smaller states which have smaller sample sizes (160 for New Hampshire), estimates from polls are always going to be more speculative. In most of my analyses I’m interested in national patterns and more aggregate comparisons among states, as, for example, here:
If you’re Kos and are interested in electoral strategies in individual states, then I’d say that my analyses and maps are a guide to national patterns and might be useful in suggesting more detailed state-by-state analyses using richer datasets. Exit polls are no panacea but are certainly a place to start here, and I plan to analyze them once I can get my hands on them.