The Open Science Collaboration, a team led by psychology researcher Brian Nosek, organized the replication of 100 published psychology experiments. They report:
A large portion of replications produced weaker evidence for the original findings despite using materials provided by the original authors, review in advance for methodological fidelity, and high statistical power to detect the original effect sizes.
“Despite” is a funny way to put it. Given the statistical significance filter, we’d expect published estimates to be overestimates. And then there’s the garden of forking paths, which just makes things more so. It would be meaningless to try to obtain a general value for the “Edlin factor” but it’s gotta be less than 1, so of course exact replications should produce weaker evidence than claimed from the original studies.
Things may change if and when it becomes standard to report Bayesian inferences with informative priors, but as long as researchers are reporting selected statistically-significant comparisons—and, no, I don’t think that’s about to change, even with the publication and publicity attached to this new paper—we can expect published estimates to be overestimates.
That said, even though these results are no surprise, I still think they’re valuable.
As I told Monya Baker in an interview for a news article, “this new work is different from many previous papers on replication (including my own) because the team actually replicated such a large swathe of experiments. In the past, some researchers dismissed indications of widespread problems because they involved small replication efforts or were based on statistical simulations. But they will have a harder time shrugging off the latest study, the value of this project is that hopefully people will be less confident about their claims.”
Nosek et al. provide some details in their abstract:
The mean effect size of the replication effects was half the magnitude of the mean effect size of the original effects, representing a substantial decline. Ninety-seven percent of original studies had significant results. Thirty-six percent of replications had significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects.
This is all fine, again the general results are no surprise but it’s good to see some hard numbers with real experiments. The only thing that bothers me in the above sentence is the phrase, “if no bias in original results is assumed . . .” Of course there is bias in the original results (see discussion above), so this just seems like a silly assumption to make. I think I know where the authors are coming from—they’re saying, even if there was no bias, there’d be problems—but really the no-bias assumption makes no sense given the statistical significance filter, so this seems unnecessary.
Anyway, great job! This was a big effort and it deserves all the publicity it’s getting.
Disclaimer: I am affiliated with the Open Science Collaboration. I’m on the email list, and at one point I was one of the zillion authors of the article. At some point I asked to be removed from the author list, as I felt I hadn’t done enough—I didn’t do any replication, nor did I do any data analysis, all I did was participate in some of the online discussions. But I do feel generally supportive of the project and am happy to be associated with it in whatever way that is.