Skip to content

You’ll get a high Type S error rate if you use classical statistical methods to analyze data from underpowered studies

Brendan Nyhan sends me this article from the research-methods all-star team of Katherine Button, John Ioannidis, Claire Mokrysz, Brian Nosek, Jonathan Flint, Emma Robinson, and Marcus Munafo:

A study with low statistical power has a reduced chance of detecting a true effect, but it is less well appreciated that low power also reduces the likelihood that a statistically significant result reflects a true effect. Here, we show that the average statistical power of studies in the neurosciences is very low. The consequences of this include overestimates of effect size and low reproducibility of results. There are also ethical dimensions to this problem, as unreliable research is inefficient and wasteful. Improving reproducibility in neuroscience is a key priority and requires attention to well-established but often ignored methodological principles.

I agree completely. In my terminology, with small sample size, the classical approach of looking for statistical significance leads to a high rate of Type S error. indeed this is a theme of my paper with Weakiem (along with much earlier literature in psychology research methods). I’d love this stuff even more if they stopped using the word “power” which unfortunately is strongly tied to the not-so-useful notion of statistical significance. Also I didn’t notice if they mentioned the statistical significance filter—the problem that statistically-significant results tend to have high Type M errors. In any case, it’s good to see this stuff getting further attention. Also I think it would be useful for them to go further and provide guidance into how to better analyze data from small samples. Saying not to design low-power studies is fine, but once you have the data there’s no point in ignoring what you have.


  1. Anonymous says:

    isn’t “the statistical significance filter” and “the winner’s curse” basically the same thing (fig 5)?

  2. Mayo says:

    The real problem is taking the observed difference as a good indication of a population discrepancy: the inference has terrible severity. I’ve commented on this issue in papers (under fallacy of rejection), and in a recent comment on my blog:

  3. John says:

    I’m particularly concerned about the last point raised by Andrew. If we cannot find a statistically significant result to a null hypothesis at the 0.05 level, are we to simply dismiss all the data as noise with no further analysis required? In my experience, there is a tendency to do this, especially from “old-school” statisticians. However, it would seem that even statistically “insignificant” data provide some information that can be used to help understand the problem at hand, perhaps simply in terms of prediction. I think we have to differentiate statistically insignificant from worthless.

  4. yop says:

    About power, what would be a good alternative term/concept?

  5. Stuart Buck says:

    I wonder if James Heckman believes in the first line of the abstract there. See for what Heckman says about very small studies on the value of preschool.

  6. ezra abrams says:

    and whoose fault is it that the neurosci people do underpowerd experiments ?
    could the poor instruction from the stat dept have anything to do with it ?

    I assume most readers here are familiar with a similar phenom in the mainstream media, where the media will be aghast at something that the media created, eg, the return of E spitzer and Weiner to NY politics; is there any doubt the hyper coverage boosted their name recognition, which led to high poll numbers, which cause the media to decry the high poll numbers of said sleazes ?

    • rvman says:

      Probably has more to do with the fact that neuro studies are bloody expensive on a per subject basis. Also, a lot of neuro papers are based on subjects with rare and weird brain issues (lesions on the hippocampus and such) compared with ‘normal’ subjects. These folks are hard to come by ‘naturally’, and experiements where the researcher inflicts a brain injury, then studies the effect, rarely pass IRB review.

  7. Aaron says:

    I will save you one googling with this link to where the author explains what type S and M errors are:

    • Eric Rasmusen says:

      Thank you!

      The ghastly terms Type I and Type II need reforming, like “power” and “significance”, but *please* come up with informative names like “publication-hurdle error” or even “false negatives”.

  8. Anonymous says:

    The problem here is trumpeting findings from small samples, not so much the small samples themselves. After all, they can all be combined in a metaanalysis, and average out idiosyncratic errors of each.

    • Joel Chan says:

      I was wondering about this very issue: in my field, there are a lot of underpowered studies (e.g., due to the difficulty in getting suitable participants, as we are not interested in typical WEIRDo undergrads), and after reading this, I wondered if it would be better to reject underpowered studies from getting into the literature (even if they were otherwise methodologically sound), or if it would be better to have them accumulate in the literature (properly caveated of course), and then later employ meta-analysis (or some other method) to try to converge on a more reliable inference. Has this been successfully done elsewhere? Or have there been published statistical studies that prove or address this point?

      • Andrew says:


        My idea is to put these sorts of studies into journals with names like Speculations in Psychological Science. If such studies are presented as what they are, without all the hype, then maybe they can be useful.

        • Joel Chan says:

          Nice. Unfortunately, the current incentive structures are not in favor of either (a) such journals existing and having such submissions, or (b) authors losing the hype and getting real about the speculatory nature of their results. I am reviewing a manuscript now that has an otherwise sound design and interesting (and theoretically plausible) findings in an ecologically valid context, but with a rather low N (15), and am thinking of recommending (b).

          • Andrew says:


            I agree. But I think there should be space in the ecosystem for such papers. Perhaps I say this because I’ve done a fair amount of speculative research, and I’d love to have a place to park such results. I guess i could/should just post a bunch on arxiv or just submit to low-ranking journals, but I just don’t get around to doing so.

  9. Eli says:

    I second the call for a more constructive contribution. What constitutes a “small” sample is ambiguous, and doubtless the problems that plague very low-power studies still cause your average social science dissertation to sniffle.

  10. […] like this paper a lot (possibly because I already had a high opinion of an earlier paper by Katherine Button). There’s been a lot of discussion in the last couple […]