Skip to content
 

Can you trust international surveys?

On the sister blog I report on a new paper, “Don’t Get Duped: Fraud through Duplication in Public Opinion Surveys,” by Noble Kuriakose, a researcher at SurveyMonkey, and Michael Robbins, a researcher at Princeton and the University of Michigan, who gathered data from “1,008 national surveys with more than 1.2 million observations, collected over a period of 35 years covering 154 countries, territories or subregions.”

They did some forensics, looking for duplicate or near-duplicate records as a sign of faked data, and they estimate that something like 20% of the surveys they studied had “substantial data falsification via duplication.”

These were serious surveys such as Afrobarometer, Arab Barometer, Americas Barometer, International Social Survey, Pew Global Attitudes, Pew Religion Project, Sadat Chair, and World Values Survey. To the extent these surveys are faked in many countries, we should really be questioning what we think we know about public opinion in these many countries.

The claims of Kuriakose and Robbins are controversial. Their method for detecting fraud has lots of degrees of freedom—they’re not looking for exact duplication, merely near-duplication, and their numbers are compared to a simplified baseline model. In a review performed by Pew Research, Katie Simmons, Andrew Mercer, Steve Schwarzer and Courtney Kennedy argue that “natural, benign survey features can explain high match rates.”

So I’m not quite sure what to think. My guess right now is that the number of duplicate responses is high enough that, once this all shakes out, we will indeed take it as strong evidence of some duplication of the survey responses. But I haven’t looked at all the details. It seems worth looking at for anyone who has used these surveys or is planning to use these surveys to understand international public opinion.

Let’s get specific

One thing that bugs me is that Kuriakose and Robbins identified surveys where they think there’s fraud—but then they don’t tell us which surveys in which countries had the problems! If 20% of surveys have lots of duplicate records, does that mean 80% are basically ok? And what about those 20%? We should be told which surveys they are, and even which records are duplicated!

The data are all out there so I guess other researchers could download the surveys, one by one, and check for duplicates themselves. But since Kuriakose and Robbins already did the work, why not just share the information.

CheckIt

What we really need, I guess, is an R package called CheckIt which will automatically scan a survey for near-duplicate records and also report some sort of summary. Then someone can write a script and run CheckIt on thousands of surveys and post the results somewhere.

P.S. Mike Spagat points to this page with Stata code by Kuriakose for the checking of duplicates. He also says that other people are coding it in R. I guess it would be easy enough to recode from scratch.

10 Comments

  1. EpiPete says:

    The Pew paper is quite thorough and well worth the read.

    For fellow MoTs, there’s a chuckle inducing table towards the end of the paper where they stratify “high match” percentages to the 2014 Religious Landscape Study by religion.

  2. Curt B says:

    A very important topic…There is work being done though in the area of survey data quality for international nutrition and other health surveys. ENA software, for example, performs various of data quality checks and gives a final quality score. These checks are not based on duplications per se, but expected distributions of age, gender, standard deviation of WHZ, etc. for children under 5 years of age.

  3. Rahul says:

    The devil’s in the detail: How do they define a “near-duplicate”?

  4. lewis55 says:

    I get the overall idea and it makes sense, but I don’t see how there could be any consistent/objective cross-survey standard for “near-duplicate” from this method that you could pick up and apply to a specific survey you wanted to test for fraud without knowing the details of the survey. What looks fishy in one survey might be explainable in another. Depending on the qualitative content of the survey and the survey questions, duplicates should be more or less likely.

    Let’s say you’re a political scientist doing one of these various barometer surveys and you’re interviewing people in an ethically homogenous poor area, where nearly everybody supports political party A, and thus has similar views about the whole range of partisan political issues you’re going to ask them about. Responses to a lot of these questions should be correlated with each other if they’re all explained by underlying partisanship and demographic characteristics. So in that survey you’ll get relatively more “near duplicate” responses than in a survey on the same questions with a more ethnically, politically, socio-economically diverse population, even assuming no fraud. I fear people who are not topical experts could misuse this tool and become overzealous, labeling the first survey as somehow “more” fraudulent.

    I guess it would only work if you set a very high level of duplication as the cutoff to label the survey fraudulent to estimate a lower bound on the number of surveys that might be fraudulent. But what should that bar be?

  5. Michael Spagat says:

    The Kuriakose and Robbins test could be used in two distinct ways.

    1. To get a sense of how widespread fabrication is for some class of surveys, e.g. international surveys.

    2. To trigger an investigation into a particular survey.

    For the first purpose we need to attempt some assessment of the prevalence of false positives and false negatives although, obviously, this is hard to do. But we should remember that K and R consider only one form of fabrication which suggests to me that the false negative rate could be pretty high.

    For the second purpose any test can only trigger further investigation except possibly in an extreme case where you have a lot of exact duplicates.

    The Pew rebuttal is interesting but in the end its main point is that the 85% cut-off is not magic. This has to be true. In statistics the only magical cut-off that separates fact from fiction is the p = 0.05 one (that was a joke)

    By the way, Kuriakose and Robbins recommend also looking at the distribution of max duplicates but the Pew analysis ignores this suggestion and looks only at the threshold.

    I think that the Pew chunk of the data used by K and R can be found by going here:

    http://www.pewglobal.org/category/datasets/

    and downloading “Spring 2013 data”, “Spring 2012 data” etc.

    Maybe someone has time to have a look?

  6. Jake Bowers says:

    Hi All,

    I asked myself what distribution the Kuriakose and Robins fraud test-statistic would take if there were “no fraud” and played around a bit with that idea here https://github.com/jwbowers/kuriakoserobins . If you mess around with that code you’ll see that I put “no fraud” in quotes because, in the end, I wasn’t sure how to represent that case. I came up with a couple of options — the “no fraud” world is the world in which, in essence, you can swap the values of variables across columns within person or the “no fraud” world is the one in which the responses of one person were not related to the responses of another person. In the end, I wanted to preserve what one might call the natural correlation between variables and the natural dependence among people while breaking any fraud-based dependencies. I didn’t figure that out. Maybe some of you will know how to think about this.

    In the end, I thought that the idea of a test statistic for fraud is a good idea, but, since I wasn’t sure what the “non fraud” distribution ought to arise from, I didn’t have a good idea about whether or not their idea performed well. The results on that github site suggest that high numbers on their test statistic are common even when relationships within and across people are broken. But, I’m not sure if that is the best way standard of comparison. And, so far, I’m not sure what the standard of comparison ought to be for their application.

    Jake

Leave a Reply