Bad Numbers: Media-savvy Ivy League prof publishes textbook with a corrupted dataset

[cat picture]

I might not have noticed this one, except that it happened to involve Congressional elections, and this is an area I know something about.

The story goes like this. I’m working to finish up Regression and Other Stories, going through the examples. There’s one where we fit a model to predict the 1988 elections for the U.S. House of Representatives, district by district, given the results from the previous election and incumbency status. We fit a linear regression, then used the fitted model to predict 1990, then compared to the actual election results from 1990. A clean example with just a bit of realism—the model doesn’t fit perfectly, there’s some missing data, there are some choices in how to set up the model.

This example was in Data Analysis Using Regression and Multilevel/Hierarchical Models—that’s the book that Regression and Other Stories is the updated version of the first half of—and for this new book I just want to redo the predictions using stan_glm() and posterior_predict(), which is simpler and more direct than the hacky way we were doing predictions before.

So, no problem. In the new book chapter I adapt the code, cleaning it in various places, then I open an R window and an emacs window for my R script and check that everything works ok. Ummm, first I gotta find the directory with the old code and data, I do that, everything seems to work all right. . . .

I look over what I wrote one more time. It’s kinda complicated: I’d imputed winners of uncontested elections at 75% of the two-party vote—that’s a reasonable choice, it’s based on some analysis we did many years ago of the votes in districts the election before or after they became uncontested—but then there was a tricky thing where I excluded some of these when fitting the regression and put them back in the imputation. In rewriting the example, it seemed simpler to just impute all those uncontested elections once and for all and then do the modeling and fitting on all the districts. Not perfect—and I can explain that in the text—but less of a distraction from the main point in this section, which is the use of simulation for nonlinear predictors, in this case the number of seats predicted to be won by each party in the next election.

Here’s what I had in the text: “Many of the elections were uncontested in 1988, so that y_i = 0 or 1 exactly; for simplicity, we exclude these from our analysis. . . . We also exclude any elections that were won by third parties. This leaves us with n = 343 congressional elections for the analysis.” So I went back to the R script and put the (suitably imputed) uncontested elections back in. This left me with 411 elections in the dataset, out of 435. The rest were NA’s. And I rewrote the paragraph to simply say: “We exclude any elections that were won by third parties in 1986 or 1988. This leaves us with $n=411$ congressional elections for the analysis.”

But . . . wait a minute! Were there really 34 24 districts won by third parties in those years? That doesn’t sound right. I go to the one of the relevant data file, “1986.asc,” and scan down until I find some of the districts in question:

The first column’s the state (we were using “ICPSR codes,” and states 44, 45, and 46 are Georgia, Louisiana, and Mississippi, respectively), the second is the congressional district, third is incumbency (+1 for Democrat running for reelection, -1 for Republican, 0 for an open seat), and the last two columns are the votes received by the Democratic and Republican candidates. If one of those last two columns is 0, that’s an uncontested election. If both are 0, I was calling it a third-party victory.

But can this be right?

Here’s the relevant section from the codebook:

Nothing about what to do if both columns are 0.

Also this:

For those districts with both columns -9, it says the election didn’t take place, or there was a third party victory, or there was an at-large election.

Whassup? Let’s check Louisiana (state 45 in the above display). Google *Louisiana 1986 House of Representatives Elections* and it’s right there on Wikipedia. I have no idea who went to the trouble of entering all this information (or who went to the trouble of writing a computer program to enter all this information), but here it is:

So it looks like the data table I had was just incomplete. I have no idea how this happened, but it’s kinda embarrassing that I never noticed. What with all those uncontested elections, I didn’t really look carefully at the data with zeroes -9’s in both columns.

Also, the incumbency information isn’t all correct. Our file had LA-6 with a Republican incumbent running for reelection, but according to Wikipedia, the actual election was an open seat (but with the Republican running unopposed).

I’m not sure what’s the best way forward. Putting together a new dataset for all those decades of elections, that would be a lot of work. But maybe such a file now exists somewhere? The easiest solution would be to clean up the existing dataset just for the three elections I need for the example: 1986, 1988, 1990. On the other hand, if I’m going to do that anyway, maybe better to use some more recent data, such as 2006, 2008, 2010.

No big deal—it’s just one example in the book—but, still, it’s a mistake I should never have made.

This is all a good example of the benefits of a reproducible workflow. It was through my efforts to put together clean, reproducible code that I discovered the problem.

Also, errors in this dataset could have propagated into errors in these published articles:

[2008] Estimating incumbency advantage and its variation, as an example of a before/after study (with discussion). {\em Journal of the American Statistical Association} {\bf 103}, 437–451. (Andrew Gelman and Zaiying Huang)

[1991] Systemic consequences of incumbency advantage in U.S. House elections. {\em American Journal of Political Science} {\bf 35}, 110–138. (Gary King and Andrew Gelman)

[1990] Estimating incumbency advantage without bias. {\em American Journal of Political Science} {\bf 34}, 1142–1164. (Andrew Gelman and Gary King)

I’m guessing that the main conclusions won’t change, as the total number of these excluded cases is small. Of course those papers were all written before the era of reproducible analyses, so it’s not like the data and code are all there for you to re-run.

1. Tyrtaeus says:

I dare anyone to find an academic as classy and forthright as Andrew Gelman! Thank you for the update, and showing what methodological honesty and professional transparency should be in an age of p-hacking and publication bias. Hats off to you, Sir!

• Andrew says:

Tyrtaeus:

Thanks—but I’d rather be right the first time, than wrong and classy in the correction! Anyway, it’s all a good lesson moving forward.

• Bob says:

I read this post and wanted to generate a one-word reply “classy” but Tyraeus beat me to to. Damn! -1

Bob

2. James says:

I am confused…in the snippet of data, I don’t see any where both of the last two columns are zero, just cases where the last two columns are -9.

• Andrew says:

James:

Ahh, yes, you’re right. That’s why I’d mistakenly assumed that these were elections won by third parties, because in the codebook it said it had to be a third-party winner or an at-large election.” I guess there’s an outside possibility that the elections there were at-large, but I don’t think so, and these cases are not listed in the excepth.asc file. It seems that they were just missing data. Perhaps whoever was putting the file together couldn’t find those numbers, for whatever reasons, and just left the columns blank, and then later when we did the analysis we just never went back to check.

3. Alex Gamma says:

34?

4. Of course those papers were all written before the era of reproducible analyses, so it’s not like the data and code are all there for you to re-run.

That was certainly before the mainstream starting being more careful, but long after reproducible computation. This is something computer scientists have known how to do for a long time. My first book (1992) had a software package (not statistical) that reproduced all the analyses and even let you create new ones; I was distributing it via FTP before there was a world-wide-web (of course, the internet existed long before the web). We were sharing reproducible analyses over the internet in the mid-80s. By the mid-to-late 90s, I was writing reproducible stats packages in a machine-learning style for search and speech recognition, but I couldn’t share them outside of Bell Labs. It’s just a natural thing for someone in software to want to have reproducible results. It’s just a basic sanity check that the program does what it says on the tin.

• Keith O'Rourke says:

I remember doing this is the mid-80s too – but I think it required some real support from management to spend the time and effort. That might have been what was rare at the time.

(I also remember programming groups at various companies strongly objecting to efforts to ensure their work was reproducible and managers backing down on it.)

• Andrew says:

Bob:

Fair enough. The ideas and techniques of reproducibility were out there but I was unfamiliar with their importance.