Dan Kahan doesn’t trust the Turk

Dan Kahan writes:

I [Kahan] think serious journals should adopt policies announcing that they won’t accept studies that use M Turk samples for types of studies they are not suited for. . . . Here is my proposal:

Pending a journal’s adoption of a uniform policy on M Turk samples, the journal should should oblige authors who use M Turk samples to give a full account of why the authors believe it is appropriate to use M Turk workers to model the reasoning process of ordinary members of the U.S. public. The explanation should consist of a full accounting of the authors’ own assessment of why they are not themselves troubled by the objections that have been raised to the use of such samples; they shouldn’t be allowed to dodge the issue by boilerplate citations to studies that purport to “validate” such samples for all purposes, forever & ever. Such an account helps readers to adjust the weight that they afford study findings that use M Turk samples in two distinct ways: by flagging the relevant issues for their own critical attention; and by furnishing them with information about the depth and genuineness of the authors’ own commitment to reporting research findings worthy of being credited by people eager to figure out the truth about complex matters.

There are a variety of key points that authors should be obliged to address.

First, M Turk workers recruited to participate in “US resident only” studies have been shown to misrepresent their nationality. Obviously, inferences about the impact of partisan affiliations distinctive of US society on the reasoning of members of the U.S. general public cannot validly be made on the basis of samples that contain a “substantial” proportion of individuals from other societies (Shapiro, Chandler and Muller 2013) Some scholars have recommended that researchers remove from their “US only” M Turk samples those subjects who have non-US IP addresses. However, M Turk workers are aware of this practice and openly discuss in on-line M Turk forums how to defeat it by obtaining US-IP addresses for use on “US worker” only projects (Chandler, Mueller & Paolacci 2014). Why do the authors not view this risk as one that makes using M Turk workers inappropriate in a study like this one?

Second, M Turk workers have demonstrated by their behavior that they are not representative of the sorts of individuals that studies of political information-processing are supposed to be modeling. Conservatives as grossly under-represented among M Turk workers who represent themselves as being from the U.S. (Richey 2012). One can easily “oversample” conservatives to generate adequate statistical power for analysis. But the question is whether it is satisfactory to draw inferences about real US conservatives generally from individuals who are doing something that such a small minority of real U.S. conservatives are willing to do. It’s easy to imagine that the M Turk “US” conservatives lack sensibilities that ordinary US conservatives normally have—such as the sort of disgust sensibilities that are integral to their political outlooks (Haidt & Hersch 2001), and that would likely deter them from participating in a “work force” a major business activity of which is “tagging” the content of on-line porn. These unrepresentative “US” conservatives might well not react as strongly or dismissively toward partisan arguments on a variety of issues. Is this not a concern for the authors? It is for me, and I’m sure would be for many readers trying to assess what to make of a study like this.

Third, there are in fact studies that have investigated this question and concluded that M Turk workers do not behave the way that US general population or even US student samples do when participating in political information-processing experiments (Krupnikov & Levine 2014). Readers will care about this—and about whether the authors care.

Fourth, Amazon M Turk worker recruitment methods are not fixed and are neither warranted nor designed to be calibrated to generate samples suitable for scholarly research. No serious person who cares about getting at the truth would accept the idea that a particular study done at a particular time could “validate” M Turk, for the obvious reason that Amazon doesn’t publicly disclose its recruitment procedures, can change them and has on multiple occasions, and is completely oblivious to what researchers care about. A scholar who decides it’s “okay” to use M Turk anyway should tell readers why this does not trouble him or her.

Fifth, M Turk workers share information about studies and how to respond to them (Chandler, Mueller & Paolacci 2014). This makes them completely unsuitable for studies that use performance-based reasoning proficiency measures, which M Turk workers have been massively exposed to. But it also suggests that the M Turk workforce is simply not an appropriate place to recruit subjects from; they are evincing a propensity to behave in a manner that makes all of their responses highly suspect. Imagine you discovered that the firm you had retained to recruit your sample had a lounge in which subjects about to take the study could discuss it w/ those who just had completed it; would you use the sample, and would you keep coming back to that firm to supply you with study subjects in the future? If this does not bother the authors, they should say so; that’s information that many critical readers will find helpful in evaluating their work.

I [Kahan] feel pretty confident M Turk samples are not long for this world.

OK, so far so good. But I’d bet the other direction on whether M Turk samples (or something similar) are long for this world. Remember Gresham’s Law?

Kahan does also give a positive argument, that there is a better alternative:

Google Consumer Surveys now enables researchers to field a limited number of questions for between $1.10 & $3.50 per complete– a fraction of the cost charged by on-line firms that use valid & validated recruitment and stratification methods.

Google Consumer Surveys has proven its validity in the only way that a survey mode–random-digit dial, face-to-face, on-line –can: by predicting how individuals will actually evince their opinions or attitudes in real-world settings of consequence, such as elections. Moreover, if Google Surveys goes into the business of supplying high-quality scholarly samples, they will be obliged to be transparent about their sampling and stratification methods and to maintain them (or update them for the purposes of making them even more suited for research) over time. . . .

The problem right now w/ Google Consumer Surveys is that the number of questions is limited and so, as far as I can tell, is the complexity of the instrument that one is able to use to collect the data, making experiments infeasible.

But I predict that will change.

OK, maybe so. But it does seem to me that M Turk’s combination of low cost and low validity will make it an attractive option for many researchers.

Some background:

Don’t trust the Turk (also see discussion in comments, back from the days when the sister blog had a useful comments section)

Researchers are rushing to Amazon’s Mechanical Turk. Should they?

That latter post, by Kathleen Searles and John Barry Ryan, concludes that “platitudes such as ‘Don’t trust the Turk’ are nice, but, as is often the case in life, they are too simple to be followed.”

I actually think “Don’t trust the Turk” is a slogan not a platitude but I take their point, and indeed even though Searles and Ryan are broadly pro-Turk while Kahan is anti-Turk, these researchers all offer the common perspective that when evaluating a data source you need to consider the purpose for which it will be used.

P.S. Some good discussion in comments. “Don’t trust the Turk” doesn’t mean “Never use the Turk. It means: Be aware of the Turk’s limitations. Don’t exhibit the sort of blind faith associated with the buggy-whip lobby and their purported “grounding in theory.”

18 thoughts on “Dan Kahan doesn’t trust the Turk

  1. “But the question is whether it is satisfactory to draw inferences about real US conservatives generally from individuals who are doing something that such a small minority of real U.S. conservatives are willing to do.”

    That’s true of anyone participating in any survey or research study, right? I don’t see how being a conservative who works on mTurk is any weirder than being a conservative who participates in a national survey panel, or a conservative who’s an undergrad at a top university and participates in research studies.

    • Nick:

      For studies of college students, sure, it’s the same issue that you’d want to be careful about generalizing to the larger population. But national surveys are different—they’re supposed to be representative. They’re far from perfect, but a representative sample is the goal. The claim of Kahan and others, I believe, is that a sample of MTurk participants is more like a sample of college students than like a random sample of Americans.

      There are also additional concerns: (1) As a researcher, you don’t even know who the MTurk participants are. (2) The MTurk participants discuss studies with each other and are generally active and might “behave in a manner that makes all of their responses highly suspect.”

      • Andrew,

        I agree wholeheartedly that mTurk samples are much closer to samples of college students than nationally representative samples from Google Consumer Surveys or another panel research company. I think the arguments about opacity in recruitment and participants influencing each others’ responses through discussing the study are especially important to consider, but as Dan points out, researchers are going to continue using mTurk because the range experimental designs that can be implemented through one of these panel companies is severely restricted. If you want participants to interact with one another in a specific order or social network structure (much of my research), you’re stuck with using undergrads or mTurk. Not that necessity is an argument for validity – perhaps you and/or Dan would argue that we shouldn’t conduct or publish any studies using undergrad or mTurk samples whatsoever. When the alternative to using mTurk (or not doing the research at all) is to use a population of about 500 undergrads that are very similar to one another in not only their demographics but also their political beliefs, moral values, intelligence, cognitive style, etc., and almost all of them have participated in every study posted to the subject pool in the last year (and probably discussed many of them with other participants), mTurk starts to look like a much better alternative.

        I agree with Dan that anyone trying to publish results from an mTurk sample needs to address these criticisms. If you’ve tried to publish anything from an mTurk sample in the past few years, you’ve likely encountered all of these arguments before and (hopefully) have thought through them enough to justify using mTurk. No one should be able to drum up a priming study with n = 20 and slap on a boilerplate citation to a defense of mTurk’s validity. I’m not in favor of banning the publication of mTurk results, but journals should require an appendix addressing each of these points, and encourage practices like:

        * explicitly modeling interactions with other variables and estimating the effect at the level of the population
        * detecting and excluding careless responses
        * openly sharing raw data and analysis code

        I’ll admit I don’t know much about the sampling procedures of these panel companies. Are response rates to recruitment really better than what’s found in election polling? You’ve made me skeptical that we can ever assume the representativeness of a sample, and we should use other methods to quantify it and estimate population parameters.

  2. The problem is bigger than Mechanical Turk. The majority of all psychology studies using students in the subject pool made up of the intro psych classes (I’m in the college of education, so I get preservice teachers in my subject pool). Psychologist Steven J. Heine has noted that all of these subjects are WEIRD (Western, Educated, Industrial, Rich, Democratic), i.e., we are only really looking at a sample of less than 15% of the population and a highly non-representative one at that.

    The latest edition of the You Are Not so Smart podcast cover it:
    http://boingboing.net/2015/08/06/psychologys-unhealthy-obses.html

    Here is the link to the key paper:
    http://www.ncbi.nlm.nih.gov/pubmed/20550733

  3. M Turk workers recruited to participate in “US resident only” studies have been shown to misrepresent their nationality. Obviously, inferences about the impact of partisan affiliations distinctive of US society on the reasoning of members of the U.S. general public cannot validly be made on the basis of samples that contain a “substantial” proportion of individuals from other societies (Shapiro, Chandler and Muller 2013)

    I suspect that this problem is overstated.

    The actual finding in Shapiro, Chandler, and Mueller (2013, 214) is that 33 of 530 participants (6%) had non-U.S. IP addresses. The corresponding finding in Chandler, Mueller, and Paolacci (2014, 116) is 11 of 300 (4%). And in one of my own studies, I found that 40 of 1506 participants (3%) were from non-U.S. addresses. These numbers shouldn’t be ignored, but they also don’t rise to the level of a major concern.

    However, M Turk workers are aware of this practice and openly discuss in on-line M Turk forums how to defeat it by obtaining US-IP addresses for use on “US worker” only projects (Chandler, Mueller & Paolacci 2014).

    It’s true that workers outside the U.S. can obtain U.S. IP addresses. But I cannot find any mention of this issue in Chandler, Mueller, and Paolacci (2014).

  4. Andrew,

    I absolutely love your blog. And of all the blogs out there, I would expect this one to give greater weight to evidence than to someone’s opinion. Kahan’s opinions about MTurk are biased and contrast most of the relevant empirical evidence from studies conducted by people like Jeremy Freese. Kahan cites exactly one empirical study comparing results from MTurk and other samples, and showing a lack of correspondence. He ignores the overwhelming number of studies, which have been conducted in many different disciplines and focused on dissimilar topic areas, finding that you more often than not get similar results from MTurk and other types of samples (e.g., Knowledge Networks, college students) (Behrend et al., 2011; Berinsky et al., 2012; Buhrmester et al. 2011; Byrun and Szeredi, 2015; Crump et al., 2013; Enochson and Culbertson, 2015; Sprous, 2011; Weinberg, Freese, and McElhattan, 2014).

    Kahan cites other studies talking about the characteristics of MTurkers, which are not necessarily problematic if taken into account, and the possible data quality problems with MTurk (e.g., they may lie about US residence). Of course, even outstanding data sets like the NLSY and the NCVS have issues, such as non-trivial proportions of respondents changing race or going from being very old to young over the course of the longitudinal survey. Anyway, It seems like a slight to all the scholars seriously trying to assess the costs and benefits of using MTurk to just ignore the bulk of the literature in favor of one person’s biased account.

    Justin

    • “He ignores the overwhelming number of studies, which have been conducted in many different disciplines and focused on dissimilar topic areas, finding that you more often than not get similar results from MTurk and other types of samples”

      “more often than not get similar results” doesn’t sound good enough to justify using MTurk samples — “more often than not” includes “slightly more than half the time”, and “similar results” is pretty vague.

      • And here is the conundrum. I can post no less than eight references (a partial list) to relevant empirical studies, and people set against MTurk will focus only on my terminology.

        Just a heads up, but you will rarely get identical results if you analyze data from two different probability surveys (e.g., Gallup, Pew, GSS, ANES). Shocking, I know. Random sampling error + question wording effects + question order effects + mode effects + interviewer effects + else = avoid survey research, unless “similar” findings are acceptable.

        • Terminology is important. I am well aware that you cannot expect identical results from different surveys, even if all are done well, but one needs some examination of whether or not results are *similar enough* to give evidence that MTurk is good enough. There was nothing in your comments mentioning such an examination. Your comments were like an executive summary;they did not include any details that would be needed to convince typical readers of this blog.

        • Typical readers of this blog would probably need to read the cited papers, and some counterpoints, in order to be convinced. I wouldn’t expect anyone to expect to be convinced by a couple paragraphs in a blog comment section, neither yours (Martha) nor Justin’s. But it’s very helpful that Justin’s giving citations.

    • Justin:

      I pointed Kahan to your comment so we’ll see what he says. In the meantime, let me emphasize what I wrote in my above post, that even though Searles and Ryan are broadly pro-Turk while Kahan is anti-Turk, these researchers all offer the common perspective that when evaluating a data source you need to consider the purpose for which it will be used. A bit of Turk caution is appropriate given the prevalence in many research areas of the Freshman Fallacy.

      There really are papers published in top journals making quite general claims based on MTurk samples, without any recognition of any problems, I think from a naive view that sampling doesn’t matter for correlational or causal inference.

  5. Also want to add that this controversy (or whatever you want to call it) over mTurk seems to facilitate motivated reasoning on the part of researchers and especially reviewers (not trying to point the finger at Dan or anyone in particular here). If you find a result that jives with and builds on the established literature, psychology journals are happy to publish without concerning themselves with these criticisms. But if you fail to replicate a major finding that has been shown in a sample of ~100 undergrads, then it’s “Sorry, mTurk isn’t reliable,” even if you have 10x the sample size and more valid measures.

  6. Martha,

    I worry that the theory that MTurk can’t be trusted is one of those vampirical theories that can’t be killed with evidence.

    Regardless, here are some specifics from one study by Weinberg, Freese, and McElhattan (2014), who compared MTurk (MT) to Knowledge Networks (KN):

    “In terms of the univariate distributions of these measures, no significant differences were found between the KN and MT samples on any of the items for either the racial discrimination or the reasonable accommodation vignettes (Table 2). In contrast, all the outcome measures were significantly different between the two samples for the sexual harassment vignette, with MT respondents more congenial to interpreting the scenario as harassment. If we use an OLS regression model controlling for age, the difference between the unweighted KN and MT samples is reduced from .34 points to .15, or by about 57 percent (see Table 3). Adding further controls for gender, education, race/ethnicity, marital status, and (logged) income only reduces the difference from .15 to .12, although the difference does remain statistically significant even with these controls.”

    “We find that MT experiments produce potentially better data than do the KN experiments in the narrow sense of having fewer problem respondents” [as indicated by passing comprehension checks, not speeding through surveys in less than 6 minutes, having low item-nonresponse, and not engaging in non-differentiation].

    “Our three vignette experiments encompassed ten different experimental conditions. In only three conditions were there significantly different effect sizes between the MT and KN platforms. Two of these were largely accounted for by age difference in respondents, and it is plausible the remaining difference might be explained by readily measurable differences in political ideology. For the most part, then, we would have observed substantively the same results in our experiments had we used MT instead of KN, and most of the remaining differences could have been addressed by reweighting the samples to match the known population age distribution”

    Of course, this is just one study. The other studies I cited were conducted at different times and focused on different variables and research questions. So there is no guarantee that MTurk will be useful today, tomorrow, or for other research questions.

    But if you had to have a prior belief that governed the peer review process, which of the following currently has the most empirical support? 1) Journals should not accept MTurk samples because they can’t be trusted; or 2) Journals should generally be open to MTurk samples, assuming authors note the limitations of the samples and there is no evidence of bias in the case of a specific study.

    • In my field –experimental psychology– one generally encounters two sorts of empirical studies: (a) low-power studies with first year psychology undergrads; (b) high-power studies with M-Turkers. As Justin points out, a lot of evidence suggests that M-Turk studies are OK: one finds the same results and of the same magnitude. The benefits of having more power in psychological experiments are evident. Of course researchers can do both: a lab study *and* a high-power M-Turk replication.

      • So, if a high powered study has the same results as a low powered study, then both studies are considered acceptable and are accepted into scientific discourse. However, if a high powered study has different results than a low powered study, then both are rejected from acceptance into the scientific discourse? I’m not sure I am following the logic here.

Leave a Reply to Nick Rohrbaugh Cancel reply

Your email address will not be published. Required fields are marked *