Skip to content

Don’t trust the Turk

Screen Shot 2013-07-10 at 6.36.10 PM

Dan Kahan gives a bunch of reasons not to trust Mechanical Turk in psychology experiments, in particular when studying “hypotheses about cognition and political conflict over societal risks and other policy-relevant facts.”


  1. Jay Verkuilen says:

    A lot of people have been warning about the overuse of Shane Frederick’s three item quiz for a LONG time.

  2. Anonymous says:

    Seeing this kind of post makes me feel that judea has a point. We need a formal syntax for assumptions underlying causal inference and generalizability.

    Pointing out these problems via longform blogpost exposition is always helpful, but it’s also like playing whack-a-mole. For every study that gets called out there’s another ten researchers who don’t have a clue what they’re doing.

  3. Rahul says:

    Is there a good reason Mechanical Turk was accepted as a legit way to run psychology experiments? It does sound like a stupid and silly idea.

    Is this only a backdoor for researchers to churn out sensationalist, fast-track papers at minimal cost? What’s a good reason to not outright stop publishing such studies.

    • Mark says:

      Mturk studies replicate studies that use other samples as well. For example, a paper by Chambers et al in Psych Science uses student samples, mturk samples, and the ANES and finds very similar results across studies. Interestingly, this study shows similarities between libs and cons, something Kahan seems to suggest is difficult to do on Mturk.

      (some of ) The problems with Mturk seem to arise when well known and oft-used paradigms that rely on naive participants used in a sample with individuals familiar with the paradigm. As mentioned above Shane Frederick’s three-item quiz is problematic (Mturkers are either smarter than the general population or they know the answers from rote memorization), as are common moral dilemmas adopted in moral psychology (e.g., the so-called trolley problems) or other common experimental manipulations (e.g., some easy to administer power re-call manipulations).

      Although selection issues can arise as Kahan discusses (same is true for student samples, but which sample is worse, I don’t know), I don’t see such a problem with participants answering similar questions about policy etc because these are issues that people (at least politically interested people) think about and assess often with or without surveys…Plus longitudinal panel studies ask participants similar policy relevant questions all the time (e.g. ANES, GSS) and I don’t think we want to throw those away.

  4. Hal Pashler says:

    People sliming Mechanical Turk as a method of behavioral data collection should study the figures in this recent paper, which show exquisite replications of many detailed result patterns in basic cognitive psychology:

    These are studies using within-subject designs and repeated measures, so they have an excellent level of precision. This is a field where a high degree of replicability has been the rule, not the exception, over the years. People doing this kind of research often DO take the time to replicate each other (because it is cheap to do so) and they generally have no trouble doing so.

    So this is exactly the right place to look if you want a real test of the potential quality of a new data source like MT, and Gureckis data are very reassuring. Many other cognitive labs (including mine) have also made parallel tests of lab and web data and been very reassured by the results.

    I am not surprised at all to hear that people who do between-subject designs where they have very low power in the lab and on the web find that their results bop around all over the place. Check out Dance of the P-Values if you want to understand why that is happening to you–but don’t blame web data collection, which–used properly–can help advance psychological research a lot.

    (Of course, particular problems like Ss lying about country of origin, repeat subjects, etc. are real challenges that need to be met.)

    • Rahul says:

      This is the crux not a parenthetical footnote:

      Of course, particular problems like Ss lying about country of origin, repeat subjects, etc. are real challenges that need to be met.

    • dmk38 says:

      Hi, Hal.

      I think this might be one of those situations in which the discussion is getting a bit unmoored from the matter that was the subject of the post—in this case, what I wrote about my concern with MTurk samples. I’m pretty sure I agree with you, and there’s nothing in what you’ve said that is at odds with what I wrote.

      I was careful to point out that the question of sample validity depends on the psychological dynamic being investigated. Indeed, I broke my discussion into two parts in order to make this super clear.

      In the first, I stressed that the sample validity issues posed by studies that examine psychological dynamics understood to be invariant across people are very different from ones that turn on individual differences. For the former, any group of people — so long as they haven’t experienced some sort of mental trauma etc. — should be fine. That’s why studies that use college students to examine dynamics like “perceptual continuity” etc. are just fine.

      If one is testing hypotheses about individual differences, however, then things are slightly different. The sample doesn’t have to be “representative” of the general population, necessarily (indeed, that won’t even be sufficient in all cases, given power concerns). However, the researcher must be sure (a) that the sample contains a sufficient number of the members of the groups in which the relevant individual differences in cognition are thought to exist (the power issue); and (b) that the members of the sample weren’t recruited or selected in a manner that could underrepresent typical members of the group or overrepresent atypical ones (a selection bias issue).

      For reasons I explain in the second post, I think there is good reason to believe that MT samples don’t satisfy (a) & (b) *if* what one is interested in studying the relationship between ideology & motivated reasoning with regard to evidence bearing on climate change and other issues that turn on disputed empirical facts.

      First, the composition of MT samples is skewed toward liberals in a manner that suggests that “typical” conservatives are not participating in the studies.

      Second, there is evidence, reported in Shapiro, D. N., Chandler, J., & Mueller, P. A. Using Mechanical Turk to Study Clinical Populations. Clinical Psychological Science (advance on-line 2013), that many MT workers participating in “US only” studies are using foreign IP addresses, a phenomenon that other researchers have reported and that suggests that many of the subjects are in fact non US residents (the IP addresses are predominantly Indian ones).

      Third, Chandler, Mueller & Paolacci, Nonnaïveté among Amazon Mechanical Turk workers: Consequences and solutions for behavioral researchers, Behavioral Methods (advance on-line 2013), present evidence that M Turk study subjects have been repeatedly exposed to performance-based measures of cognition the validity of which depends on nonexposure—and that as a result they score higher than one would expect among naïve subjects. This problem has also been observed by plenty of researchers.

      Under these circumstances, one has to wonder whether the “individual differences” being reported in studies that use M Turk samples generalize to the members of the U.S. public who are politically polarized on climate change, gun control, etc.

      In contrast, the sorts of cognitive processes examined in the paper you cite—reaction time, the Stroop effect, & other types of experiment differ from simple surveys in that (in the author’s words) demand “sustained attention from participants, comprehension of complex instructions, and millisecond accuracy for response,” are in the first category. I accept that M Turk might be fine for that (although the authors of the paper you cite did note that “while most of replications were qualitatively successful and validated the approach of collecting data anonymously online using a web-browser, others revealed disparity between laboratory results and online results”).

      Indeed, what bothers me is the mechanical “one for all” mentality in which papers like the one you cite are held up as establishing in a categorical fashion that “M Turk samples are valid”—thus preempting critical attention to sample validity issues in review.

      Of course, it would bother me too if the same sort mechanical approach were applied to what I said – so that people understood me to be saying that b/c M Turk isn’t valid (in my view) for studying one sort of dynamic, M Turk samples aren’t valid for studying anything!

      I’m curious to know whether you disagree with the points that *I* actually made in my post. Indeed, I’d be delighted to publish a guest post by you discussing M Turk sample validity issues!


      • sd says:

        to the point about non-naive m-turk subjects: I find it hard to believe that psych undergrads participating in dozens of studies for credit (and studying psychology!) are somehow even more naive than that stay at home dad filling in mech-turk surveys in the afternoon. that would be very sad.

        • Jay Verkuilen says:

          Subject pool subjects usually aren’t, but recruitment is often done to get people who are naive. The main responsible use of a subject pool is to beta-test your experiment. Thus the data analysis isn’t taken all that seriously, but you’re looking for places where the subjects get tripped up. And of course for more basic cognitive processes they’re reasonable, but of course very constant on age and education levels.

          What’s more troublesome IMO is that there’s a lot of volunteering bias going on among what we called at Illinois as “elevator subjects”, because experiments for pay were usually advertised on the Psych Building’s elevator. Many of them have been in numerous psych experiments, often of the same sort in the same lab. I haven’t a clue who they are but they go over and over.

    • Jay Verkuilen says:

      In a psychophysics type study it’s not unusual for there to be a few subjects who are the authors and their grad students. Such basic psychological effects are usually pretty well universal. Questions that involve things like policy preferences, sexuality, or anything that’s heavily freighted with culture is going to run into some serious issues. Then there are context effects: People who go to the wall in the dictator game in one context may well not in others. That doesn’t mean that MTurk is useless, but that it’s limited. Most of our samples are.

      A Slate article citing the fact that college students in the West are WEIRD:

      And a one-pager in Science:

      And the BBS article:

  5. […] regression adjustment) to correct for differences between sample and population. If the data are crap, it’ll be hard to trust anything that comes out of your analysis, but multilevel modeling […]