I don’t know what ATR is but I’m glad somebody is on the job of prohibiting replication catastrophe:
Seriously, though, I’m on a list regarding a reproducibility project, and someone forwarded along this blog by psychology researcher Simone Schnall, whose attitudes we discussed several months ago in the context of some controversies about attempted replications of some of her work in social psychology.
I’ll return at the end to my remarks from July, but first I’d like to address Schnall’s recent blog, which I found unsettling. There are some technical issues that I can discuss:
1. Schnall writes: “Although it [a direct replication] can help establish whether a method is reliable, it cannot say much about the existence of a given phenomenon, especially when a repetition is only done once.” I think she misses the point that, if a replication reveals that a method is not reliable (I assume she’s using the word “reliability” in the sense that it’s used in psychological measurement, so that “not reliable” would imply high variance) then it can also reveal that an original study, which at first glance seemed to provide strong evidence in favor of a phenomenon, really doesn’t. The Nosek et al. “50 shades of gray” paper is an excellent example.
2. Her discussion of replication of the Stroop effect also seems to miss the point, or at least so it seems to me. To me, it makes sense to replicate effects that everyone believes, as a sort of “active control” on the whole replication process. Just as it also makes sense to do “passive controls” and try to replicate effects that nobody thinks can occur. Schnall writes that in the choice of topics to replicate, “it is irrelevant if an extensive literature has already confirmed the existence of a phenomenon.” But that doesn’t seem quite right. I assume that the extensive literature on Stroop is one reason it’s been chosen to be included in the study.
The problem, perhaps, is that she seems to see the goal of replication as a goal to shoot things down. From that standpoint, sure, it seems almost iconoclastic to try to replicate (and, by implication, shoot down) Stroop, a bit disrespectful of this line of research. But I don’t see any reason why replication should be taken in that way. Replication can, and should, be a way to confirm a finding. I have no doubt that Stroop will be replicated—I’ve tried the Stroop test myself (before knowing what it was about) and the effect was huge, and others confirm this experience. This is a large effect in the context of small variation. I guess that, with some great effort, it would be possible to design a low-power replication of Stroop (maybe use a monochrome image, embed it in a within-person design, and run it on Mechanical Turk with a tiny sample size?), but I’d think any reasonable replication couldn’t fail to succeed. Indeed, if Stroop weren’t replicated, this would imply a big problem with the replication process (or, at least with that particular experiment). But that’s the point, that’s one reason for doing this sort of active control. The extensive earlier literature is not irrelevant at all!
3. Also I think her statement, “To establish the absence of an effect is much more difficult than the presence of an effect,” misses the point. The argument is not that certain claimed effects are zero but rather that there is no strong evidence that they represent general aspects of human nature (as is typically claimed in the published articles). If an “elderly words” stimulus makes people walk more slowly one day in one lab, and more quickly another day in another lab, that could be interesting but it’s not the same as the original claim. And, in the meantime, critics are not claiming (or should not be claiming) an absence of any effect but rather they (we) are claiming to see no evidence of a consistent effect.
In her post, Schnall writes, “it is not about determining whether an effect is “real” and exists for all eternity; the evaluation instead answers a simply question: Does a conclusion follow from the evidence in a specific paper?”—so maybe we’re in agreement here. The point of criticism of all sorts (including analysis of replication) can be to address the question, “Does a conclusion follow from the evidence in a specific paper?” Lots of statistical research (as well as compelling examples such as that of Nosek et al.) has demonstrated that simple p-values are not always good summaries of evidence. So we should all be on the same side here: we all agree that effects vary, none of us is trying to demonstrate that an effect exists for all eternity, none of us is trying to establish the absence of an effect. It’s all about the size and consistency of effects, and critics (including me) argue that effects are typically a lot smaller and a lot less consistent than are claimed in papers published by researchers who are devoted to these topics. It’s not that people are “cheating” or “fishing for significance” or whatever, it’s just that there’s statistical evidence that the magnitude and stability of effects are overestimated.
4. Finally, here’s a statement of Schnall that really bothers me: “There is a long tradition in science to withhold judgment on findings until they have survived expert peer review.” Actually, that statement is fine with me. But I’m bothered by what I see as an implied converse, that, once a finding has survived expert peer review, it should be trusted. Ok, don’t get me wrong, Schnall doesn’t say that second part in this most recent post of hers, and if she agrees with me—that is, if she does not think that peer-reviewed publication implies that a study should be trusted—that’s great. But, from her earlier writings on this topic give me the sense that she believes that published studies, at least in certain fields of psychology, should get the benefit of the doubt: that, once they’ve been published in a peer-reviewed publication, they should stand on a plateau and require some special effort to be dislodged. So when Study 1 says one thing and pre-registered Study 2 says another, she seems to want to give the benefit of the doubt to Study 1. But I don’t see that.
Different fields, different perspectives
A lot of this discussion seems somehow “off” to me. Perhaps this is because I do a lot of work in political science. And almost every claim in political science is contested. That’s the nature of claims about politics. As a result, political scientists do not expect deference to published claims. We have disputes, sometimes studies fail to replicate, and that’s ok. Research psychology is perhaps different in that there’s traditionally been a “we’re all in this together” feeling, and I can see how Schnall and others can be distressed that this traditional collegiality has disappeared. From my perspective, the collegiality could be restored by the simple expedient of researchers such as Schnall recognizing that the patterns they saw in particular datasets might not generalize to larger populations of interest. But I can see how some scholars are so invested in their claims and in their research methods that they don’t want to take that step.
I’m not saying that political science is perfect, but I do think there are some differences in that poli sci has more of a norm of conflict whereas it’s my impression that research psychology has more of the norms of a lab science where repeated experiments are supposed to give identical results. And that’s one of the difficulties.
If scientist B fails to replicate the claims of scientist A who did a low-power study, my first reaction is: hey, no big deal, data are noisy, the patterns in the sample do not generally match the patterns in the population, certainly not if you condition on “p less than .05.” But a psychology researcher trained in this lab tradition might not be looking at sampling variability as an explanation—nowhere in Schnall’s blogs did I see this suggested as a possible source of the differences between original reports and replications—and, as a result, they can perceive a failure to replicate as an attack on the original study, to which it’s natural for them to attack the replication. But once you become more attuned to sampling and measurement variation, failed replications are to be expected all the time, that’s what it means to do a low-power study.
Continue reading ‘Replication controversies’ »