Skip to content

“A bug in fMRI software could invalidate 15 years of brain research”


About 50 people pointed me to this press release or the underlying PPNAS research article, “Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates,” by Anders Eklund, Thomas Nichols, and Hans Knutsson, who write:

Functional MRI (fMRI) is 25 years old, yet surprisingly its most common statistical methods have not been validated using real data. Here, we used resting-state fMRI data from 499 healthy controls to conduct 3 million task group analyses. Using this null data with different experimental designs, we estimate the incidence of significant results. In theory, we should find 5% false positives (for a significance threshold of 5%), but instead we found that the most common software packages for fMRI analysis (SPM, FSL, AFNI) can result in false-positive rates of up to 70%. These results question the validity of some 40,000 fMRI studies and may have a large impact on the interpretation of neuroimaging results.

This is all fine (I got various emails with lines such as, “Finally, a PPNAS paper you’ll appreciate”), and I’m guessing it won’t surprise Vul, Harris, Winkielman, and Pashler one bit.

I continue to think that the false-positive, false-negative thing is a horrible way to look at something like brain activity, which is happening all over the place all the time. The paper discussed above looks like a valuable contribution and I hope people follow up by studying the consequences of these FMRI issues using continuous models.


  1. Garrett M says:

    But without a framework for identifying false-positives, how could I be 95% confident that your brain is actually capable of thought? ;p

  2. Alex says:

    If you want a slightly more accurate cat-fMRI picture, go through the slide show at until you find the one labeled ‘Tony Tiger’.

  3. Xi'an says:

    Andrew, please correct the spelling: Thomas Nichols (University of Warwick).

    Tom gave a talk about this issue and the subsequent reactions at the CRiSM workshop in which you virtually took part.

  4. Garnett says:

    I believe that this statement “These results question the validity of some 40,000 fMRI studies and may have a large impact on the interpretation of neuroimaging results”

    has been edited in the original paper (see the correction) to read

    “These results question the validity of a number of fMRI studies and may have a large impact on the interpretation of weakly significant neuroimaging results.”

    I believe the reason is that relatively few of the 40,000 studies used the problematic methodology.

  5. Eytan Adar says:

    Worth reading about how the authors tried to correct the article and how PPNAS initially rejected the errata:

    As well as the added bibliographic analysis they did:

  6. Mike Lawrence says:

    As I understand it, the problem was that they were using a spatial smoother with a Gaussian tail when the data were in fact heavy-tailed. This would be handleable in stan by implementing a Gaussian Process with a covariance function that includes a “tail-heaviness” parameter (like the df parameter in student-t), then let the model do inference on this parameter.

  7. I think the description “A bug in fMRI software” is unfortunate.

    As I understand it problem isn’t that the software had errors in its implementation of the algorithms or that the algorithms didn’t produce the intended estimator. The problem is that the model was wrong.

    • Anoneuoid says:

      >”The problem is that the model was wrong.”

      Yep, this issue has been very mischaracterized, even by those who discovered it. It isn’t a bug, it isn’t a “false positive” problem. It is a “true positive” problem. The stats machinery correctly identified their null model as a poor fit for the data.

  8. At the risk of self-promotion, along with a few others I put out a preprint ( that argues that while in many cases it seems that family-wise error (FWE) control wasn’t achieved using parametric, random field theory (RFT), that FWE is perhaps an unnecessarily high bar to clear. We propose that control of false discovery rate is a more natural target, and introduce a nonparametric method to accomplish this. When we look at the same task data considered in this paper, we see that things are not quite as bleak as they appear (similar to what others, including the authors, have since suggested).

Leave a Reply