## Is 8+4 less than 3? 11% of respondents say Yes!

Shane Frederick shares some observations regarding junk survey responses:

Obviously, some people respond randomly. For open ended questions, it is pretty easy to determine the fraction who do so. In some research I did with online surveys, “asdf” was the most common and “your mama” was 9th. This fraction is small (maybe 1-2%). But the fraction of random responses is harder to identify (and is likely higher) for items with binary and multichotomous response options, since many respondents must realize their random responding can go undetected. Hence, you can’t use the random response rate from open ended questions to assess this. You can do other things to try to estimate it (like ask them “Is 8+4 less than 3?” YES NO). But two problems remain: the fraction saying YES is a blend of random and perverse responding and both of these things vary across items. Dramatically.

I put up a few questions on Google Consumer Surveys with large samples. Random + perverse response rates differ dramatically:

Do you have a fraternal twin? YES NO
4% Yes. *Pretty close to truth*

Do you have an identical twin? YES NO
8% Yes. *Pretty far from truth, but funnier to lie about?*

Is 8+4 less than 3? YES NO
11% Yes. *Profound innumeracy, confusion, or just fucking with me?*

Were you born on the planet Neptune? YES NO
17% Yes. *Perhaps using it metaphorically, as in “My friends say I’m a weird guy”?*

In a recent published paper I [Frederick] averred that you could just multiply the number of people who endorse something crazy by the number of response options to estimate the fraction of random responders. But this is obviously wrong.

So, basically, I’m not sure what to do. You could look at response latencies or something, but then you end up imposing some arbitrary thresholds which are unsatisfying, much like removing outliers without any good justification that the responses are not sincere.

My reply: These responses are hilarious. I believe there is some literature on this sort of thing but I’m not the expert on it. I’ve looked a bit into list experiments (you can search my blog, I have a post with a title like, A list of reasons not to trust list experiments) but there seems to be a lot of information on the actual responses. Maybe you could learn something by regressing these on demographics, also see if the same people who give wrong answers for some of these, give wrong answers for others.

1. Dale says:

This seems quite important to me – and I am not familiar with what people have done, so references would be most appreciated. Given the importance you have emphasized about measurement in the past, this is a fundamental measurement issue. Surveys are fraught with problems – in my mind, none more important than the fact that there is little incentive for respondents to tell the truth or think carefully about their responses. Given this, it would seem that all surveys have a common measurement problem – are the responses accurate reflections of what the respondents believe? Evidence such as this shows that we must question the responses we get. Hopefully, there is some way to measure the potential bias or uncertainty due to inaccurate responses (whether they are due to intentional misrepresentation or just lack of interest/ability to respond).

Surely this is an issue for all political polling? It is also an issue in teaching evaluations. I have long wondered about some of the numerical responses in the latter – to the point that I don’t really trust when someone checks strongly agree or disagree for all the questions, whether they actually meant the opposite. The only part of the evaluations I trust at all are the written comments. So, I think it would be quite valuable if there were ways to estimate the degree to which survey responses involve untrustworthy responses.

• There are a lot of different strategies. You can design in checks like infrequently-endorsed items, inconsistency scales, or directed questions. Or you can look for patterns in the data, like long strings of the same responses, multivariate outliers, etc. Here are a couple of references to articles that have looked at many different indices in the same dataset to estimate frequencies of careless responders, identify different patterns of careless responding (there’s more than one kind), and see which indices work the best at catching them:

Maniaci, M. R., & Rogge, R. D. (2014). Caring about carelessness: Participant inattention and its effects on research. Journal of Research in Personality, 48, 61-83.
http://www.sciencedirect.com/science/article/pii/S009265661300127X

Meade, A. W., & Craig, S. B. (2012). Identifying careless responses in survey data. Psychological methods, 17(3), 437.

• James C. Whanger says:

You have to be careful about excluding these types of data where the potential overlap between inattention and the construct at hand may be meaningful. For example, you might exclude respondents with ADD or ADHD where variation on other variables may be important in a study. Or you might exclude depressed respondents where you might not want to, etc.

For population level estimates, I believe it is better to model the problematic patterns and in applied settings to use them as flags. Throwing data out to improve the strength of a correlation seems a slippery slope, though I have certainly done similar things using similar rationale.

• James C. Whanger says:

I have to say that my thinking has changed drastically in this area and I would argue for the modeling of the response patterns and statistical corrections rather than data deletion.

2. Petter says:

Here is the post “Thinking of doing a list experiment? Here’s a list of reasons why you should think again”: http://andrewgelman.com/2014/04/23/thinking-list-experiment-heres-list-reasons-think/

3. Rahul says:

How about embedding a “test” question, where it isn’t obvious that it is a “test” question.

Maybe something like: “Which one of these sites did you visit in the last 24 hours” where the options are themselves created from Google’s knowledge of the person’s search history or something? Not sure about the Privacy Policy. Or ask them their age / sex etc. and compare it with the predicted age profile that Google offers?

4. James C. Whanger says:

If we assume a 12 hr. time format, then 89% got the question wrong.

5. Lord says:

I have seen a variety of things, from raw checks like month of birth, increasing the number of wrong answers (what year is it?), asking in two way to see if they match (age, year of birth), captchas, attention and numeracy (8+4 is..), checking literacy and reading by saying to ignore the question and check this and that, providing don’t know options. If there is no correct or good option can it be skipped or is a choice enforced, if the option is last 3 months or never whether never begins at 4 or 6, whether factual or felt (my evil twin), whether general or personal (my inflation rate), ambiguities like whether brands or makers (products or companies), whether absolutely or relatively (trust, trust within purview, trust to do, trust not to do), whether distinguishable on some basis or similar to others. Speed can result in errors but aren’t necessarily random.

• Rahul says:

Do people model for the sequence number of the Question concerned?

i.e. Intuitively I’d trust respondent’s answers on early questions way more than the latter ones.

• James C. Whanger says:

It is fairly common to use sequencing as a presentation condition to check for sequence effects.

• James C. Whanger says:

Another way is to construct an item both positively and negatively and check for a match.

“I will vote for Hillary Clinton.”
“I will not vote for Hillary Clinton”

Some use terms like ‘always’ and ‘never’ as checks under the assumption that neither is likely true.

• James C. Whanger says:

Though, I think the always and never checks can misdiagnose a thinking style as deception.

• Bill Jefferys says:

In the general election? In the primary? Could have “inconsistent” answers here, from someone who plans to vote for Bernie in the primary but doesn’t expect him to be nominated…

6. Rahul says:

Isn’t there a pretty big conflict of interest here? The survey designer / firm must hate it to reveal that a large proportion of their respondents were junk.

• James C. Whanger says:

It just means you have to increase your sampling based on the estimated amount of junk.

7. ASV says:

Is the goal here to identify the percentage of bogus responders or to identify them all individually for elimination from dataset? Or to figure out if their distribution is biased in some meaningful way? Seems like there are different paths to each of those outcomes, and different levels of difficulty.

• James C. Whanger says:

ASV:

That depends on the setting. For a political poll, the correction is the way to go as individual responses are only important as a reflection of the population of interest. In assessment, the identification of problematic response patterns is useful as individual level decisions will be made with the data, though a population level correction could be useful for regression equation flags that may be used as well in these situations.

8. This is all going to look like missing data, only harder. Now there’s a latent “silliness” to the response and the question is whether respondents are silly at random or silly with some pattern (sensitive questions, hard questions, late in the survey, etc.) And as @James C. Whanger says, it depends on what inferences you want to make how you should go about modeling this.

• Martha says:

“This is all going to look like missing data, only harder” brings to mind something I recently saw pointed out:

One big problem in trying to use medical data bases is this problem of “measurement uncertainty”, in some form or another. For example, the data base does not have “date when disease developed,” but does have “date when disease was diagnosed.” So the latter needs to be used as proxy for the former. But this becomes problematical when the question is “what contributes to developing the disease” (or “what can contribute to preventing the disease”), since symptoms arise from the disease, but are used to diagnose the disease. Thus, the “real time” progression is causes–>disease–>symptoms –> diagnosis. In particular, both causes and symptoms precede diagnosis, which is used as proxy for disease, but only causes precede onset of disease, so the big question is how to tease out which recorded events are causes and which are symptoms.

9. mark says:

When people answer randomly to a series of questions that are designed to measure the same thing and you average or sum across all those item then central limit theorem suggests that you will end up with a whole lot of responses in the middle of the response option continuum. This becomes especially problematic when you assess something with a very low or very high base rate. Take the example of a psychopathy self-report inventory. Most people (who are paying attention) will respond with the “never” or “strongly disagree” option for question like “How often do you torture animals” or “I like to hurt people”, so the random responders end up looking like psychopaths. If have a similar phenomenon happening for another variable you can end up with some completely artifactual correlations driven by a small subset of random responders.
A lot of survey tools let you look at response times for individuals questions and allows at least some of these datapoints to be identified.

• John L says:

Based on your examples, I’m not convinced you’ve ever even seen a self-report psychopathy scale. While I agree that responses from some individuals can be problematic in survey-type measures like self-report scales, there’s pretty good evidence in the case of psychopathy measured using such scales in university samples that people who score high and people who score low respond in different and systematic ways when asked to do more “objective” tasks (e.g., identify the emotion being expressed in a face).

• mark says:

Weirdly I am pretty familiar with self-report measures of psychopathy. Items very much like that form part of various self-report inventories of dark triad traits like the Comprehensive Assessment of Sadistic Tendencies by Buckels and Paulhus. Enjoyment from the pain of others is almost the definition of the hostile form of psychopathy and cruelty to animals is a well-known behavioral manifestation of psychopathy and sociopathy.

10. Kaiser says:

Use the randomized response technique? Just instruct the responder to pick a specific choice. Can even randomize that specific choice.

• Phil says:

I don’t like that sort of check because it is a little insulting to the more serious respondents.

11. Zathras says:

With 8+4 > 3, I wonder if there were people skimming the question and saw 13 instead of 3. My first millisecond glance had my brain register it as 13. I don’t know why–maybe asking about comparing 8+4 and 13 makes more intuitive sense?

• Lord says:

It is bound to be true under some modulus but they never ask me which.

12. Elin says:

Diane Ravitch sometimes claims that on the NAEP high school seniors just pick all the same answers because they know it doesn’t mean anything. http://dianeravitch.net/2014/02/13/kids-whats-the-matter-with-kids-today/.

One I was asked to analyze some surveys where the forms had been dropped off at people’s homes and picked up the next day. On some of the grid questions people had just drawn a straight line down the “Strongly agree” column.

• zbicyclist says:

I admit it. If the survey states “this will take less than 5 minutes of your time” I give them about 8 minutes and then figure if they are @#\$% with me, I’ll @#\$% with them.

I am not ashamed of this. Poorly designed surveys are a plague on the nation.

13. Phil says:

There are some great suggestions to ensure quality responses in this webinar: https://www.youtube.com/watch?v=PVWJg2vVwpU

14. dmk38 says:

isn’t the first thing to ask whether this a noise problem or a bias problem? People who answer randomly are noise. People who answer “perversely” could create bias. But that’s a problem closely related to “demand effects” & the like; a well designed study should be able to rule out that the effect was a consequence of subjects deliberately trying to generate a particular outcome– whether to “please” the researcher or shock him or her (or more likely aid & abet the researcher in generating a shocking bogus result). Bottom line is to design the study in a manner that gives you & other reasonable people confidence that disruptive subjects could only have created noise & not bias.

And then live w/ it; if you try to remove “noisy” subjects to improve the strength or “significance” of your result, there is too big a risk you’ll throw out subjects whose only offense is that they didn’t behave consistent with your hypothesis.

15. Jay Verkuilen says:

There is a literature on misclassification. A good book on measurement error will discuss it, for instance, John Buonacorsi’s Measurement Error: Models, Methods, and Applications, CRC Press, 2010. I forget what he says about it exactly but he put a lower bound on such categorical responses being accurate of 1% even when respondents are being careful just due to slips. I would suspect that really casually and unimportant questions such as the ones listed here would be an order of magnitude larger, and indeed they are.

16. Stephanie says:

can we judge anything by responses to the 4 questions in the article? Perhaps, the questions about having a twin made sense in the context of the questionnaire – but maybe not . The last 2 questions are ,, nonsensical, & respondents are answering in the spirit of the question. Ask a silly question…

Perhaps, it’s important that researchers ask questions that make sense to survey takers.