Low power and the replication crisis: What have we learned since 2004 (or 1984, or 1964)?

I happened to run across this article from 2004, “The Persistence of Underpowered Studies in Psychological Research: Causes, Consequences, and Remedies,” by Scott Maxwell and published in the journal Psychological Methods.

In this article, Maxwell covers a lot of the material later discussed in the paper Power Failure by Button et al. (2013), and the 2014 paper on Type M and Type S errors by John Carlin and myself. Maxwell also points out that these alarms were raised repeatedly by earlier writers such as Cohen, Meehl, and Rozeboom, from the 1960s onwards.

In this post, I’ll first pull out some quotes from that 2004 paper that presage many of the issues of non-replications that we still are wrestle with today. Then I’ll discuss what’s been happening since 2004: what’s new in our thinking in the past fifteen years.

I’ll argue that, yes, everyone should’ve been listening to Cohen, Meehl, Roseboom, Maxwell, etc., all along; and also that we have been making some progress, that we have some new ideas that might help us move forward.

Part 1: They said it all before

Here’s a key quote from Maxwell (2004):

When power is low for any specific hypothesis but high for the collection of tests, researchers will usually be able to obtain statistically significant results, but which specific effects are statistically significant will tend to vary greatly from one sample to another, producing a pattern of apparent contradictions in the published literature.

I like this quote, as it goes beyond the usual framing in terms of “false positives” etc., to address the larger goals of a scientific research program.

Maxwell continues:

A researcher adopting such a strategy [focusing on statistically significant patterns in observed data] may have a reasonable probability of discovering apparent justification for recentering his or her article around a new finding. Unfortunately, however, this recentering may simply reflect sampling error . . . this strategy will inevitably produce positively biased estimates of effect sizes, accompanied by apparent 95% confidence intervals whose lower limit may fail to contain the value of the true population parameter 10% to 20% of the time.

He also slams deterministic reasoning:

The presence or absence of asterisks [indicating p-value thresholds] tends to convey an air of finality that an effect exists or does not exist . . .

And he mentions the “decline effect”:

Even a literal replication in a situation such as this would be expected to reveal smaller effect sizes than those originally reported. . . . the magnitude of effect sizes found in attempts to replicate can be much smaller than those originally reported, especially when the original research is based on small samples. . . . these smaller effect sizes might not even appear in the literature because attempts to replicate may result in nonsignificant results.

Classical multiple comparisons corrections won’t save you:

Some traditionalists might suggest that part of the problem . . . reflects capitalization on chance that could be reduced or even eliminated by requiring a statistically significant multivariate test. Figure 3 shows the result of adding this requirement. Although fewer studies will meet this additional criterion, the smaller subset of studies that would now presumably appear in the literature are even more biased . . .

This was a point raised a few years later by Vul et al. in their classic voodoo correlations paper.

Maxwell points out that meta-analysis of published summaries won’t solve the problem:

Including underpowered studies in meta-analyses leads to biased estimates of effect size whenever accessibility of studies depends at least in part on the presence of statistically significant results.

And this:

Unless psychologists begin to incorporate methods for increasing the power of their studies, the published literature is likely to contain a mixture of apparent results buzzing with confusion.

And the incentives:

Not only do underpowered studies lead to a confusing literature but they also create a literature that contains biased estimates of effect sizes. Furthermore . . . researchers may have felt little pressure to increase the power of their studies, because by testing multiple hypotheses, they often assured themselves of a reasonable probability of achieving a goal of obtaining at least one statistically significant result.

And he makes a point that I echoed many years later, regarding the importance of measurement and the naivety of researchers who think that the answer to all problems is to crank up the sample size:

Fortunately, an assumption that the only way to increase power is to increase sample size is almost always wrong. Psychologists are encouraged to familiarize themselves with additional methods for increasing power.

Part 2: (Some of) what’s new

So, Maxwell covered most of the ground in 2004. Here are a few things that I would add, from my standpoint nearly fifteen years later:

1. I think the concept of “statistical power” itself is a problem in that it implicitly treats the attainment of statistical significance as a goal. As Button et al. and others have discussed, low-power studies have a winner’s curse aspect, in that if you do a “power = 0.06” study and get lucky and find a statistical significant result, your estimate will be horribly exaggerated and likely in the wrong direction.

To put it another way, I fear that a typical well-intentioned researcher will want to avoid low-power studies—and, indeed, it’s trivial to talk yourself into thinking your study has high power, by just performing the power analysis using an overestimated effect size from the published literature—but will also think that a low power study is essentially a role of the dice. The implicit attitude is that in a study with, say, 10% power, you have a 10% chance of winning. But in such cases, a win is really a loss.

2. Variation in effects and context dependence. It’s not about identifying whether an effect is “true” or a “false positive.” Rather, let’s accept that in the human sciences there are no true zeroes, and relevant questions include the magnitude of effects, and how and where they vary. What I’m saying is: less “discovery,” more exploration and measurement.

3. Forking paths. If I were to rewrite Maxwell’s article today, I’d emphasize that the concern is not just multiple comparisons that have been performed, but also multiple potential comparisons. A researcher can walk through his or her data and only perform one or two analyses, but these analyses will be contingent on data, so that had the data been different, they would’ve been summarized differently. This allows the probability of finding statistical significance to approach 1, given just about any data (see, most notoriously, this story). In addition, I would emphasize that “researcher degrees of freedom” (in the words of Simmons, Nelson, and Simonsohn, 2011) arise not just in the choice of which of multiple coefficients to test in a regression, but also in which variables and interactions to include in the model, how to code data, and which data to exclude (see my above-linked paper with Loken for sevaral examples).

4. Related to point 2 above is that some effects are really really small. We all know about ESP, but there are also other tiny effects being studied. An extreme example is the literature on sex ratios. At one point in his article, referring to a proposal that psychologists gather data on a sample of a million people, Maxwell writes, “Thankfully, samples this large are unnecessary even to detect minuscule effect sizes.” Actually, if you’re studying variation in the human sex ratio, that’s about the size of sample you’d actually need! For the calculation, see pages 645-646 of this paper.

5. Flexible theories: The “goal of obtaining at least one statistically significant result” is only relevant because theories are so flexible that just about any comparison can be taken to be consistent with theory. Remember sociologist Jeremy Freese’s characterization of some hypotheses as “more vampirical than empirical—unable to be killed by mere evidence.”

6. Maxwell writes, “it would seem advisable to require that a priori power calculations be performed and reported routinely in empirical research.” Fine, but we can also do design analysis (our preferred replacement term for “power calculations”) after the data have come in and the analysis has been published. The purpose of a design calculation is not just to decide whether to do a study or to choose a sample size. It’s also to aid in interpretation of published results.

7. Measurement.

21 thoughts on “Low power and the replication crisis: What have we learned since 2004 (or 1984, or 1964)?

  1. If the main criterion for publication is p < 0.05, then the experiment is actually an experiment to measure p, not an experiment to measure the supposed experimental variable. The statistics of p are abysmal, in the sense that the standard deviation is very large because of the nature of the distribution of p. With a s.d. of about 1/4, you can place no faith on a reported p of 0.05.

    Even several repetitions cannot reduce the s.d. of p to a usable value, because of the sqrt(N) behavior. So almost all experiments that in fact measure p aren't useful for that purpose, let alone for establishing an effect size for the purported focus of the investigation.

    My take on this is for everyone to stop running experiments on p, and return to investigating the variables that are actually of interest. Easier said than done, I know!

  2. Maxwell’s 2004 paper is really quite good and, like many in the genre, very much overlooked. Thanks for bringing it to the attention of your readers. Underlying point 4 is an issue you’ve touched on lightly that I think is very important and deserves a thorough treatment. It’s what could be called the low-hanging-fruit problem. At this point in the development of the field of psychology (epidemiology/economics/political science/ect.) all the ‘big’ effects have been discovered. What we’re left with is the scientific version of crumbs — the little effects that remain once the ‘big’ slices (factors) that explain large amounts of variation have been removed from the metaphorical cake. I think the question that supercedes variation and context dependence is, how much should we care about these little effects? The only paper I’ve ever seen that even begins to tackle this question in a thoughtful way is Prentice & Miller’s (1992), ‘When small effects are impressive’. (It’s one of may favorite papers and really should be required reading for grad students in social/behavioral sciences.) Putting aside that many of the effects they discuss might not be replicable, the basic idea of considering what makes for an important small effect seems more important now than ever. I’d argue that this question goes hand-in-hand with any discussion about improving the quality and replicability of social and behavioral science findings. Sadly, I’ve yet to see anyone (re)visit this topic. An updated discussion of these issues is sorely needed and long overdue.

  3. Yes, more power is always a good thing. Particularly if it comes for free. However, there is a straightforward modification of procedures can yield a huge improvement in outcomes, and I don’t often see it given the appropriate prominence. Researchers should feel free to continue doing “fishing”, “exploratory”, “hypothesis generating” studies with lots of possible outputs, but they have to realise that you cannot test a hypothesis with the data that led to the hypothesis being considered in the first place. As soon as something turns up “significant”, “interesting”, “worthy of consideration” in the exploratory study it becomes worthwhile and necessary to test that thing with another experiment (new data) and to check whether there are other corroborating data available.

    • Michael:

      Sure, but all the preregistration in the world won’t solve the problem if measurements are too noisy. That’s one reason why, when I come across “power = 0.06” studies, I typically don’t recommend a preregistered replication. It’s the kangaroo problem.

  4. Measurement:

    It seems like part of the problem with the measurement issue is it doesn’t seem to play well with classical statistics and SPSS-type analyses. E.g., how does one account for measurement error with an ANOVA? Happily, Stan and R packages like brms allow users to accommodate measurement error and related difficulties like missing data right into the analyses. So hopefully they keep gaining steam [go Stan!]. But there’s also the issue that measurement error seems to be considered and advanced topic. E.g., McElreath’s exceptional “Statistical Rethinking” text addressed them in the second to last chapter. And I don’t want to point a finger because it’s unclear when the best time to work those concepts in is. But man, if they’re routinely considered advanced or are pushed off till the end, of course they’re among the statistical issues we substantive researchers most often neglect.

    Andrew and others, have you found useful ways to work these issues into your lower-level courses?

    • Although what you bring up is important, I don’t think this is what Andrew is talking about — my reading is that he is talking about “noisy” measures (e.g., lots of variability between subjects) rather than measurement error.

      • You make a good point.

        Happily, in the SEM world the Mplus team has started to take time series models very seriously, the consequence of which is they’re looking at single-subject analyses and building up to small group to large group time series designs—much of which they’re doing within a Bayesian framework, or Bayesian estimation, at least. They’re calling this dynamic SEM. This is of interest because some of the research they’re appealing to (e.g., Peter Molenaar’s work) highlights the disparities of group-based inference (cross-sectional and longitudinal) and idiographic inference (longitudinal). Hence, “lots of variability between subjects”.

        I highlight Mplus not to step on other software developer’s toes, but because of their prominence in the SEM community. If they’re talking about “lots of variability between subjects,” other SEMers are too. And there’s plenty of overlap between the SEM and multilevel communities. [Also happily, the Mplus team has readily pointed their users to Gelman’s work since at least 2010.]

  5. I did not, and in general do not, think that preregistration is a way forward. Many important types of basic scientific study contain too many aspects of exploration for preregistration to be useful. The study types where preregistration can help are those where the conclusions rest mostly or entirely on a single statistical summary or comparison.

  6. Isn’t a reason that they don’t understand what they’re doing? I understand the need to publish and the tendency to see things which aren’t there but, scientifically and logically, you can lie to yourself that well without actually being ignorant, without actually to a rational degree believe something that isn’t true, meaning that statistically the work is at best unfinished. They seem to think low power means the directionality implied is correct and don’t understand they may be literally wrong, like an effect isn’t there or the effect harms or whatever they don’t want to believe. I guess I could say this as low power studies must come from not wanting to believe something isn’t true. We all do this but science is supposed to be different from the regular idiocies.

  7. I certainly agree with Andrew’s points about the importance of selection effects, but there’s a confusion about using power in interpreting results that I’ve mentioned numerous time, including on this blog. So I’ll repost my comment from Aug 17, 2017 (with a slight correction of a typo at the end).

    August 17, 2017 at 4:08 pm
    http://statmodeling.stat.columbia.edu/2017/08/16/also-holding-back-progress-make-mistakes-label-correct-arguments-nonsensical/#comment-544697

    I find it perplexing that there seems to be disagreement on the one straightforward point about significance tests that critics and defenders of tests agree on. Maybe I can clarify. Critics have long berated tests because with large enough sample size, even trivial underlying discrepancies will probably produce significance. Some even say the test is only informing us of how large the sample was: Here’s Jay Kadane (2011, p. 438)*:

    “with a large sample size virtually every null hypothesis is rejected, while with a small sample size, virtually no null hypothesis is rejected. And we generally have very accurate estimates of the sample size available without having to use significance testing at all.”

    The tester agrees with this first group of critics: With large enough sample size, even trivial underlying discrepancies will probably produce significance. Her reply is either to adjust p-values down in dealing with high sample sizes (and there are formulas for this) or, as I prefer, to be clear that the discrepancy indicated by a just significant difference with larger sample size is less indicative of a discrepancy than with smaller. In other words, she says: don’t commit a mountains out of molehills fallacy.

    We are given the assumptions hold approximately, else we can’t even be talking about principles for interpreting test results.
    Now it seems there’s a second group of critics …who hold that a p-value rejection with a higher sample size is actually stronger evidence of a given discrepancy with the null than with a smaller size. So Kadane (and all who point out the Jeffreys-Lindley result–a p-value with a large enough sample size is actually evidence for the null) are mistaken, according to this second group. Tests, confidence intervals and likelihood appraisals would also be wrong.

    Consider observing a sample mean M for estimating (or testing) a normal mean with known sigma = 1. With n = 100, 1 SE = .1; n = 10,000, 1 SE = .01. We observe a 3SE difference in both cases. M = .3 in the first case, and .03 in the second. 95% confidence intervals are around
    (.1, .5) n = 100 and
    (.01, .05) n = 10,000.
    Consider the inference: mu > .1. The second group of critics appears to be saying that this inference, mu > .1, is better indicated with M = .03 (n = 10,000) than it is with M = .3 (n = 100). But it would be crazy to infer mu > .1 with observation M = .03 (n = 10,000)**, while it’s sensible to infer mu > .1 with M = .3 (n =100).

    That is why I think the only way for the position of the second group of critics to make sense is to see them as questioning the underlying assumptions needed to compute things like the SE, p-values and confidence coefficients. But then the question of which is the appropriate way to reason from test results can’t even get off the ground. n must be large enough to satisfy the assumptions.

    *Principles of Uncertainty.
    **Such a rule would be wrong w/probability ~1.

    A related blogpost:
    https://errorstatistics.com/2017/05/08/how-to-tell-whats-true-about-power-if-youre-practicing-within-the-error-statistical-tribe/
    https://errorstatistics.com/2014/03/12/get-empowered-to-detect-power-howlers/

    • The criticism from the “second group” is focused on the estimates we see after they’ve been passed through a significance filter. An unbiased estimator conditioned on significance gives us a biased estimator, with the magnitude of the bias inversely proportional to the power of the studies, hence the criticism of significant results arising from underpowered studies.

      I’ve never heard advocate from this “second group” make a claim along the lines of the one you present here: that a 3 SE difference from a large sample (small M) is more trustworthy than a 3 SE difference from a small sample (large M). The criticism is that for small samples, only large M will do if you require significance to publish, and so published estimates will be biased upward. But for large samples, the significance filter is practically non-existent (everything’s significant!), and so published estimates won’t be subject to this form of bias.

  8. Amateur question.

    From above:
    “low-power studies have a winner’s curse aspect, in that if you do a “power = 0.06” study and get lucky and find a statistical significant result, your estimate will be horribly exaggerated and likely in the wrong direction.”

    From: https://errorstatistics.com/2015/07/29/telling-whats-true-about-power-if-practicing-within-the-error-statistical-tribe/
    “If the test’s power to detect the alternative is very low, then the statistically significant x is good evidence of a discrepancy (from the null) … .”

    I don’t get it :(

    • Pren, I do believe the disagreement is due to some difference in the underlying assumptions between the two camps (Camp 1 = low *prior* power + statistical significance = good evidence against the null; Camp 2 = low *effective* power + statistical significance = exaggeration).

      In the platonic world where researchers do have a single hypothesis, a well-defined model, and assumption adequate data, Camp 1 has good reason to believe its conclusion. After all, if power is low for an a priori very low effect, a statistically significant event is really rare, which points to the possibility that our prior estimate is much lower than the true value (which suggests a greater discrepancy from the nil null).

      In the real world (at least in the real world of psychological research), researchers but rarely have a single hypothesis, a well-defined model and assumption adequate data. Model flexibility alone is enough to always get statistical significance. Coupled with selection effects, estimates will exaggerate the true effect size.

      I have done some simulations in R to demonstrate the idea ( https://imgur.com/a/zA80d ).

      For the upper plot set, power is defined as ‘finding at least one statistically significant estimate’ (not a good definition from a statistical theory point of view, but one of the most common practices in research using p-values). Each smaller plot is based on the number independent effects tested in a single study. For a single effect, the power goes just as theoretically expected for a single test: smaller effect sizes (red) have very low effective power. But as the number of effects evaluated in a single study increases, the ‘power’ to detect at least one effect is very high even for very low effect sizes – so, statistical significance in a single test (among many) is highly probable even under a small-effect world.

      For the lower plot set, we have the exaggeration ratio under the same scenarios. The number of effects evaluated in the study doesn’t seem to affect much the exaggeration ratio, as expected. Exaggeration is greater mainly in studies with low power, again as expected. The problem is, given that we are almost guaranteed to find at least one statistically significant estimate if we evaluate many effects in a single study, this effect is also guaranteed to exaggerate greatly the true effect size if power is low and the effect is small. So, under a small-effect-small-power, multiple-comparisons, select-on-statistical-significance world, we are guaranteed to publish (because we do get statistical significance) highly exaggerated effect estimates.

      In the end, as Mayo has already pointed above, it’s all about selection under statistical significance. Unfortunately, the ‘report under selection’ is much closer to actual practice than the platonic single hypothesis per study. Even when researchers are honest and report the plethora of evaluated and non-significant effects, the discussion is usually focused on what was significant.

      • I see the two camps a bit differently:

        1) camp 1 thinks this study is low power because they believe the true effect is small (and highly conditional). So if p < alpha then the effect is likely inflated. But of course if n is big then the study is not really low power, despite the small true effect.

        2) camp 2 thinks this study is low power because n is small. Under this condition, the finding p < alpha will be much more common if the true effect is big than if true effect is small. So, the most likely conclusion is the true effect is big if p < alpha. But of course if the true effect is big, then its not really low power, despite the small n.

Leave a Reply to Ulrich Schimmack Cancel reply

Your email address will not be published. Required fields are marked *