Anne Pier Salverda writes:
I came across this blog entry, “An Experience with a Registered Replication Project,” and thought that you would find this interesting.
It’s written by Simone Schnall, a social psychologist who is the first author of an oft-cited Psych Science(!) paper (“Cleanliness reduces the severity of moral judgments”) that a group of researchers from Michigan State failed to replicate as part of a big replication project.
Schnall writes about her experience as the subject of a “failed recplication”. I like the tone of her blog entry. She discusses some issues that she has with how the replication of her work, and the publication of that work in a special issue of a journal was handled. A lot of what she writes is very reasonable.
Schnall believes that her finding did not replicate because there were ceiling effects in a lot of the items in the (direct) replication of her study. So far so good; people can mistakes in analyzing data from replication studies too. But then she writes the following:
It took me considerable time and effort to discover the ceiling effect in the replication data because it required working with the item-level data, rather than the average scores across all moral dilemmas. Even somebody familiar with a research area will have to spend quite some time trying to understand everything that was done in a specific study, what the variables mean, etc. I doubt many people will go through the trouble of running all these analyses and indeed it’s not feasible to do so for all papers that are published.
She may be talking about the peer-review process here, it’s not entirely clear to me from the context. But what am I to think of her own data-analysis skills and strategies if it takes her “considerable time and effort” to discover the ceiling effects in the data? Looking at the distribution of variables is the very first thing that any researcher should do when they start analyzing their data. Step 1: exploratory data analysis! If she’s right about how this aspect of the data accounts for the failure to replicate her original finding, of course that makes for an interesting story—we need to be critical of failed replications too. But the thought that there may be a substantial number of scientists who don’t look in detail at basic aspects of their data before they start working with aggregated data makes me shake my head in disbelief.
On the details of the ceiling effect, all I can say is that I’ve made a lot of mistakes in data analysis myself, so if it really took Schnall longer than it should’ve to discover this aspect of her data, I wouldn’t be so hard on her. Exploratory analysis is always a good idea but it’s still easy to miss things.
But speaking more generally about the issues of scientific communciation, I disagree with Schnall’s implication that authors of published papers should have some special privileges regarding the discussion of their work. Think about all the researchers who did studies where they made no dramatic claims, found no statistical significance, and then didn’t get published? Why don’t they get special treatment too? I think a big problem with the current system is that it puts published work on a plateau where it is difficult to dislodge. For example, “That is the assumption behind peer-review: You trust that somebody with the relevant expertise has scrutinized a paper regarding its results and conclusions, so you don’t have to.” My response to this is that, in many areas of research, peer reviewers do not seem to be deserving of that trust. Peer review often seems to be a matter of researchers approving other papers in their subfield, and accepting claims of statistical significance despite all the problems of multiple comparisons and implausible effect size estimates.
Schnall quotes Daniel Kahneman who writes that if replicators make no attempts to work with authors of the original work, “this behavior should be prohibited, not only because it is uncollegial but because it is bad science. A good-faith effort to consult with the original author should be viewed as essential to a valid replication.” I have mixed feelings about this. Maybe Kahneman is correct about replication per se, where there is a well-defined goal to get the exact but I don’t think it should apply more generally to criticism.
Regarding the specifics of her blog post, she has a lot of discussion about how the replication is different from her study. This is all fine, but the flip side of this is that why should her original study be considered representative of the general population? Once you accept that effect sizes were vary, this calls into question the paradigm of making quite general claims based on a study of a particular sample. Finally, I was disappointed that not anywhere in her blog does she consider the possibility that maybe, just maybe, her findings were spurious. She writes, “I have worked in this area for almost 20 years and am confident that my results are robust.” The problem is that such confidence can be self-reinforcing: once you think you’ve found something, you can keep finding it. The rules of statistical significance give a researcher enough “outs” that he or she can persist in a dead-end research paradigm for a long time. Just to be clear, I’m not saying that’s what’s happening here—I know nothing about Schnall’s work—but it can happen. Consider, for example, the work of Satoshi Kanazawa.
That said, I understand and sympathize with Schnall’s annoyance at some of the criticisms she’s received. I’ve had similar feelings to criticisms of my work: sometimes people point out actual and serious errors, sometimes mistakes I’ve made of over-generalization, other times I’ve misconstrued the literature in some way. Just recently my colleagues and I made some changes in this paper because someone pointed out some small ways in which it was misleading. Luckily he told us while the paper was still being revised for publication. Other times I learn about problems later and need to issue a correction. But sometimes I do get misinformed criticisms, even published papers where someone misunderstands my work and slams it. And I don’t like it. So I see where she’s coming from.
Again, from a statistical perspective, my biggest problem with what Schnall writes is that she doesn’t seem to consider, at all, the possibility that her original results could be spurious and arise from some combination of measurement error and capitalization on noise which, looked at a certain way, might include some statistically significant comparisons. I worry that she is thinking in a binary way: her earlier paper was a success, the replication was a failure, so she seeks to explain the differences. There are certainly differences between any two experiments on the same general phenomenon—that’s why we do meta-analysis—but, as Kahneman and Tversky demonstrated four decades ago, researchers tend to systematically underestimate the importance of variation in interpreting experimental results.
Following up, Salverda writes:
In the meantime, I came across two papers. One is by Schnall (in press, I believe), a response to the paper that failed to replicate her work.
And the people who failed to replicate her work [David Johnson, Felix Cheung, and Brent Donnellan] wrote a paper that addresses her criticism.