On the sister blog I report on a new paper, “Don’t Get Duped: Fraud through Duplication in Public Opinion Surveys,” by Noble Kuriakose, a researcher at SurveyMonkey, and Michael Robbins, a researcher at Princeton and the University of Michigan, who gathered data from “1,008 national surveys with more than 1.2 million observations, collected over a period of 35 years covering 154 countries, territories or subregions.”
They did some forensics, looking for duplicate or near-duplicate records as a sign of faked data, and they estimate that something like 20% of the surveys they studied had “substantial data falsification via duplication.”
These were serious surveys such as Afrobarometer, Arab Barometer, Americas Barometer, International Social Survey, Pew Global Attitudes, Pew Religion Project, Sadat Chair, and World Values Survey. To the extent these surveys are faked in many countries, we should really be questioning what we think we know about public opinion in these many countries.
The claims of Kuriakose and Robbins are controversial. Their method for detecting fraud has lots of degrees of freedom—they’re not looking for exact duplication, merely near-duplication, and their numbers are compared to a simplified baseline model. In a review performed by Pew Research, Katie Simmons, Andrew Mercer, Steve Schwarzer and Courtney Kennedy argue that “natural, benign survey features can explain high match rates.”
So I’m not quite sure what to think. My guess right now is that the number of duplicate responses is high enough that, once this all shakes out, we will indeed take it as strong evidence of some duplication of the survey responses. But I haven’t looked at all the details. It seems worth looking at for anyone who has used these surveys or is planning to use these surveys to understand international public opinion.
Let’s get specific
One thing that bugs me is that Kuriakose and Robbins identified surveys where they think there’s fraud—but then they don’t tell us which surveys in which countries had the problems! If 20% of surveys have lots of duplicate records, does that mean 80% are basically ok? And what about those 20%? We should be told which surveys they are, and even which records are duplicated!
The data are all out there so I guess other researchers could download the surveys, one by one, and check for duplicates themselves. But since Kuriakose and Robbins already did the work, why not just share the information.
What we really need, I guess, is an R package called CheckIt which will automatically scan a survey for near-duplicate records and also report some sort of summary. Then someone can write a script and run CheckIt on thousands of surveys and post the results somewhere.
P.S. Mike Spagat points to this page with Stata code by Kuriakose for the checking of duplicates. He also says that other people are coding it in R. I guess it would be easy enough to recode from scratch.