Several people pointed me to this paper by Anders Eklund, Thomas Nichols, and Hans Knutsson, which begins:
Functional MRI (fMRI) is 25 years old, yet surprisingly its most common statistical methods have not been validated using real data. Here, we used resting-state fMRI data from 499 healthy controls to conduct 3 million task group analyses. Using this null data with different experimental designs, we estimate the incidence of significant results. In theory, we should find 5% false positives (for a significance threshold of 5%), but instead we found that the most common software packages for fMRI analysis (SPM, FSL, AFNI) can result in false-positive rates of up to 70%. These results question the validity of some 40,000 fMRI studies and may have a large impact on the interpretation of neuroimaging results.
I’m not a big fan of the whole false-positive, false-negative thing. In this particular case it makes sense because they’re actually working with null data, but ultimately what you’ll want to know is what’s happening to the estimates in the more realistic case that there are nonzero differences amidst the noise. The general message is clear, though: don’t trust FMRI p-values. And let me also point out that this is yet another case of a classical (non-Bayesian) method that is fatally assumption-based.
Perhaps what’s the most disturbing thing about this study is how unsurprising it all is. In one sense, it’s big big news: FMRI is a big part of science nowadays, and if it’s all being done wrong, that’s a problem. But, from another perspective, it’s no surprise at all: we’ve been hearing about “voodoo correlations” in FMRI for nearly a decade now, and I didn’t get much sense that the practitioners of this sort of study were doing much of anything to clean up their act. I pretty much don’t believe FMRI studies on the first try, any more than I believe “gay gene” studies or various other headline-of-the-week auto-science results.
What to do? Short-term, one can handle the problem of bad statistics by insisting on preregistered replication, thus treating traditional p-value-based studies as screening exercises. But that’s a seriously inefficient way to go: if you don’t watch out, your screening exercises are mostly noise, and then you’re wasting your effort with the first study, then again with the replication.
On the other hand, if preregistered replication becomes a requirement for a FMRI study to be taken seriously (I’m looking at you, PPNAS; I’m looking at you, Science and Nature and Cell; I’m looking at you, TED and NIH and NPR), then it won’t take long before researchers themselves realize they’re wasting their time.
The next step, once researchers learn to stop bashing their heads against the wall, will be better data collection and statistical analysis. When the motivation for spurious statistical significance goes away, there will be more motivation for serious science.
Something needs to be done, though. Right now the incentives are all wrong. Why not do a big-budget FMRI study? In many fields, this is necessary for you to be taken seriously. And it’s not like you’re spending your own money. Actually, it’s the opposite: at least within the university, when you raise money for a big-budget experiment, you’re loved, because the university makes money on the overhead. And as long as you close your eyes to the statistical problems and move so fast that you never have to see the failed replications, you can feel like a successful scientist.
The other thing that’s interesting is how this paper reflects divisions within PPNAS. On one hand you have editors such as Susan Fiske or Richard Nisbett who are deeply invested in the science-as-routine-discovery-through-p-values paradigm; on the other, you have editors such as Emery Brown (editor of this particular paper; full disclosure, I know Emery from grad school) who as a statistician has a more skeptical take and who has nothing to lose by pulling the house down.
Those guys at Harvard (but not in the statistics department!) will say, “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.” But they’re innumerate, and they’re wrong. Time for us to move on, time for the scientists to do more science and for the careerists to find new ways to play the game.
P.S. An economist writes in:
I wanted to provide a bit more context/background for your recent fMRI post. It went from a short comment to something much longer. Unfortunately, this is another time that a sensational headline misrepresents the actual content of the paper. I recently left academia and started a blog (among other things) but still have a few things far enough along that they might be published one day.