Despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings (≤ .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We [Simmons, Nelson, and Simonsohn] present computer simulations and a pair of actual experiments that demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis. Second, we suggest a simple, low-cost, and straightforwardly effective disclosure-based solution to this problem. The solution involves six concrete requirements for authors and four guidelines for reviewers, all of which impose a minimal burden on the publication process.
Whatever you think about these recommendations, I strongly recommend you read the article. I love its central example:
To help illustrate the problem, we [Simmons et al.] conducted two experiments designed to demonstrate something false: that certain songs can change listeners’ age. Everything reported here actually happened.
They go on to present some impressive-looking statistical results, then they go behind the curtain to show the fairly innocuous manipulations they performed to attain statistical significance.
A key part of the story is that, although such manipulations could be performed by a cheater, they could also seem like reasonable steps to a sincere researcher who thinks there’s an effect and wants to analyze the data a bit to understand it further.
We’ve all known for a long time that a p-value of 0.05 doesn’t really mean 0.05. Maybe it really means 0.1 or 0.2. But, as this paper demonstrates, that p=.05 can often mean nothing at all. This can be a big problem for studies in psychology and other fields where various data stories are vaguely consistent with theory. We’ve all known about these problems but it’s only recently that we’ve been aware of how serious they are and how little we should trust a bunch of statistically significant results.
Sanjay Srivastava has some comments here. My main comment on Simmons et al. is that I’m not so happy with the framing in terms of “false positives”; to me, the problem is not so much with null effects but with uncertainty and variation.