You’ll get a high Type S error rate if you use classical statistical methods to analyze data from underpowered studies

Brendan Nyhan sends me this article from the research-methods all-star team of Katherine Button, John Ioannidis, Claire Mokrysz, Brian Nosek, Jonathan Flint, Emma Robinson, and Marcus Munafo:

A study with low statistical power has a reduced chance of detecting a true effect, but it is less well appreciated that low power also reduces the likelihood that a statistically significant result reflects a true effect. Here, we show that the average statistical power of studies in the neurosciences is very low. The consequences of this include overestimates of effect size and low reproducibility of results. There are also ethical dimensions to this problem, as unreliable research is inefficient and wasteful. Improving reproducibility in neuroscience is a key priority and requires attention to well-established but often ignored methodological principles.

I agree completely. In my terminology, with small sample size, the classical approach of looking for statistical significance leads to a high rate of Type S error. indeed this is a theme of my paper with Weakiem (along with much earlier literature in psychology research methods). I’d love this stuff even more if they stopped using the word “power” which unfortunately is strongly tied to the not-so-useful notion of statistical significance. Also I didn’t notice if they mentioned the statistical significance filter—the problem that statistically-significant results tend to have high Type M errors. In any case, it’s good to see this stuff getting further attention. Also I think it would be useful for them to go further and provide guidance into how to better analyze data from small samples. Saying not to design low-power studies is fine, but once you have the data there’s no point in ignoring what you have.

16 thoughts on “You’ll get a high Type S error rate if you use classical statistical methods to analyze data from underpowered studies

  1. isn’t “the statistical significance filter” and “the winner’s curse” basically the same thing (fig 5)?

  2. I’m particularly concerned about the last point raised by Andrew. If we cannot find a statistically significant result to a null hypothesis at the 0.05 level, are we to simply dismiss all the data as noise with no further analysis required? In my experience, there is a tendency to do this, especially from “old-school” statisticians. However, it would seem that even statistically “insignificant” data provide some information that can be used to help understand the problem at hand, perhaps simply in terms of prediction. I think we have to differentiate statistically insignificant from worthless.

  3. and whoose fault is it that the neurosci people do underpowerd experiments ?
    could the poor instruction from the stat dept have anything to do with it ?

    I assume most readers here are familiar with a similar phenom in the mainstream media, where the media will be aghast at something that the media created, eg, the return of E spitzer and Weiner to NY politics; is there any doubt the hyper coverage boosted their name recognition, which led to high poll numbers, which cause the media to decry the high poll numbers of said sleazes ?

    • Probably has more to do with the fact that neuro studies are bloody expensive on a per subject basis. Also, a lot of neuro papers are based on subjects with rare and weird brain issues (lesions on the hippocampus and such) compared with ‘normal’ subjects. These folks are hard to come by ‘naturally’, and experiements where the researcher inflicts a brain injury, then studies the effect, rarely pass IRB review.

    • Thank you!

      The ghastly terms Type I and Type II need reforming, like “power” and “significance”, but *please* come up with informative names like “publication-hurdle error” or even “false negatives”.

  4. The problem here is trumpeting findings from small samples, not so much the small samples themselves. After all, they can all be combined in a metaanalysis, and average out idiosyncratic errors of each.

    • I was wondering about this very issue: in my field, there are a lot of underpowered studies (e.g., due to the difficulty in getting suitable participants, as we are not interested in typical WEIRDo undergrads), and after reading this, I wondered if it would be better to reject underpowered studies from getting into the literature (even if they were otherwise methodologically sound), or if it would be better to have them accumulate in the literature (properly caveated of course), and then later employ meta-analysis (or some other method) to try to converge on a more reliable inference. Has this been successfully done elsewhere? Or have there been published statistical studies that prove or address this point?

      • Joel:

        My idea is to put these sorts of studies into journals with names like Speculations in Psychological Science. If such studies are presented as what they are, without all the hype, then maybe they can be useful.

        • Nice. Unfortunately, the current incentive structures are not in favor of either (a) such journals existing and having such submissions, or (b) authors losing the hype and getting real about the speculatory nature of their results. I am reviewing a manuscript now that has an otherwise sound design and interesting (and theoretically plausible) findings in an ecologically valid context, but with a rather low N (15), and am thinking of recommending (b).

        • Joel:

          I agree. But I think there should be space in the ecosystem for such papers. Perhaps I say this because I’ve done a fair amount of speculative research, and I’d love to have a place to park such results. I guess i could/should just post a bunch on arxiv or just submit to low-ranking journals, but I just don’t get around to doing so.

  5. I second the call for a more constructive contribution. What constitutes a “small” sample is ambiguous, and doubtless the problems that plague very low-power studies still cause your average social science dissertation to sniffle.

  6. Pingback: Impact factor 911 is a joke

Comments are closed.