Dan Kahan writes:
I [Kahan] think serious journals should adopt policies announcing that they won’t accept studies that use M Turk samples for types of studies they are not suited for. . . . Here is my proposal:
Pending a journal’s adoption of a uniform policy on M Turk samples, the journal should should oblige authors who use M Turk samples to give a full account of why the authors believe it is appropriate to use M Turk workers to model the reasoning process of ordinary members of the U.S. public. The explanation should consist of a full accounting of the authors’ own assessment of why they are not themselves troubled by the objections that have been raised to the use of such samples; they shouldn’t be allowed to dodge the issue by boilerplate citations to studies that purport to “validate” such samples for all purposes, forever & ever. Such an account helps readers to adjust the weight that they afford study findings that use M Turk samples in two distinct ways: by flagging the relevant issues for their own critical attention; and by furnishing them with information about the depth and genuineness of the authors’ own commitment to reporting research findings worthy of being credited by people eager to figure out the truth about complex matters.
There are a variety of key points that authors should be obliged to address.
First, M Turk workers recruited to participate in “US resident only” studies have been shown to misrepresent their nationality. Obviously, inferences about the impact of partisan affiliations distinctive of US society on the reasoning of members of the U.S. general public cannot validly be made on the basis of samples that contain a “substantial” proportion of individuals from other societies (Shapiro, Chandler and Muller 2013) Some scholars have recommended that researchers remove from their “US only” M Turk samples those subjects who have non-US IP addresses. However, M Turk workers are aware of this practice and openly discuss in on-line M Turk forums how to defeat it by obtaining US-IP addresses for use on “US worker” only projects (Chandler, Mueller & Paolacci 2014). Why do the authors not view this risk as one that makes using M Turk workers inappropriate in a study like this one?
Second, M Turk workers have demonstrated by their behavior that they are not representative of the sorts of individuals that studies of political information-processing are supposed to be modeling. Conservatives as grossly under-represented among M Turk workers who represent themselves as being from the U.S. (Richey 2012). One can easily “oversample” conservatives to generate adequate statistical power for analysis. But the question is whether it is satisfactory to draw inferences about real US conservatives generally from individuals who are doing something that such a small minority of real U.S. conservatives are willing to do. It’s easy to imagine that the M Turk “US” conservatives lack sensibilities that ordinary US conservatives normally have—such as the sort of disgust sensibilities that are integral to their political outlooks (Haidt & Hersch 2001), and that would likely deter them from participating in a “work force” a major business activity of which is “tagging” the content of on-line porn. These unrepresentative “US” conservatives might well not react as strongly or dismissively toward partisan arguments on a variety of issues. Is this not a concern for the authors? It is for me, and I’m sure would be for many readers trying to assess what to make of a study like this.
Third, there are in fact studies that have investigated this question and concluded that M Turk workers do not behave the way that US general population or even US student samples do when participating in political information-processing experiments (Krupnikov & Levine 2014). Readers will care about this—and about whether the authors care.
Fourth, Amazon M Turk worker recruitment methods are not fixed and are neither warranted nor designed to be calibrated to generate samples suitable for scholarly research. No serious person who cares about getting at the truth would accept the idea that a particular study done at a particular time could “validate” M Turk, for the obvious reason that Amazon doesn’t publicly disclose its recruitment procedures, can change them and has on multiple occasions, and is completely oblivious to what researchers care about. A scholar who decides it’s “okay” to use M Turk anyway should tell readers why this does not trouble him or her.
Fifth, M Turk workers share information about studies and how to respond to them (Chandler, Mueller & Paolacci 2014). This makes them completely unsuitable for studies that use performance-based reasoning proficiency measures, which M Turk workers have been massively exposed to. But it also suggests that the M Turk workforce is simply not an appropriate place to recruit subjects from; they are evincing a propensity to behave in a manner that makes all of their responses highly suspect. Imagine you discovered that the firm you had retained to recruit your sample had a lounge in which subjects about to take the study could discuss it w/ those who just had completed it; would you use the sample, and would you keep coming back to that firm to supply you with study subjects in the future? If this does not bother the authors, they should say so; that’s information that many critical readers will find helpful in evaluating their work.
I [Kahan] feel pretty confident M Turk samples are not long for this world.
OK, so far so good. But I’d bet the other direction on whether M Turk samples (or something similar) are long for this world. Remember Gresham’s Law?
Kahan does also give a positive argument, that there is a better alternative:
Google Consumer Surveys now enables researchers to field a limited number of questions for between $1.10 & $3.50 per complete– a fraction of the cost charged by on-line firms that use valid & validated recruitment and stratification methods.
Google Consumer Surveys has proven its validity in the only way that a survey mode–random-digit dial, face-to-face, on-line –can: by predicting how individuals will actually evince their opinions or attitudes in real-world settings of consequence, such as elections. Moreover, if Google Surveys goes into the business of supplying high-quality scholarly samples, they will be obliged to be transparent about their sampling and stratification methods and to maintain them (or update them for the purposes of making them even more suited for research) over time. . . .
The problem right now w/ Google Consumer Surveys is that the number of questions is limited and so, as far as I can tell, is the complexity of the instrument that one is able to use to collect the data, making experiments infeasible.
But I predict that will change.
OK, maybe so. But it does seem to me that M Turk’s combination of low cost and low validity will make it an attractive option for many researchers.
Don’t trust the Turk (also see discussion in comments, back from the days when the sister blog had a useful comments section)
That latter post, by Kathleen Searles and John Barry Ryan, concludes that “platitudes such as ‘Don’t trust the Turk’ are nice, but, as is often the case in life, they are too simple to be followed.”
I actually think “Don’t trust the Turk” is a slogan not a platitude but I take their point, and indeed even though Searles and Ryan are broadly pro-Turk while Kahan is anti-Turk, these researchers all offer the common perspective that when evaluating a data source you need to consider the purpose for which it will be used.
P.S. Some good discussion in comments. “Don’t trust the Turk” doesn’t mean “Never use the Turk. It means: Be aware of the Turk’s limitations. Don’t exhibit the sort of blind faith associated with the buggy-whip lobby and their purported “grounding in theory.”