The failure of null hypothesis significance testing when studying incremental changes, and what to do about it

A few months ago I wrote a post, “Cage match: Null-hypothesis-significance-testing meets incrementalism. Nobody comes out alive.” I soon after turned it into an article, published in Personality and Social Psychology Bulletin, with the title given above and the following abstract:

A standard mode of inference in social and behavioral science is to establish stylized facts using statistical significance in quantitative studies. However, in a world in which measure- ments are noisy and effects are small, this will not work: selection on statistical significance leads to effect sizes which are overestimated and often in the wrong direction. After a brief discussion of two examples, one in economics and one in social psychology, we consider the procedural solution of open post-publication review, the design solution of devoting more effort to accurate measurements and within-person comparisons, and the statistical analysis solution of multilevel modeling and reporting all results rather than selection on significance. We argue that the current replication crisis in science arises in part from the ill effects of null hypothesis significance testing being used to study small effects with noisy data. In such settings, apparent success comes easy but truly replicable results require a more serious connection between theory, measurement, and data.

The body of the article begins:

A standard mode of inference in social and behavioral science is to establish stylized facts using statistical significance in quantitative studies. A “stylized fact”—the term is not intended to be pejorative—is a statement, presumed to be generally true, about some aspect of the world. For example, the experiments of Stroop and of Kahenman and Tversky established stylized facts about color perception and judgment and decision making. A stylized fact is assumed to be replicable, and indeed those aforementioned classic experiments have been replicated many times. At the same time, social science cannot be as exact as physics or chemistry, and we recognize that even the most general social and behavioral rules will occasionally fall. Indeed, one way we learn is by exploring the scenarios in which the usual laws of psychology, politics, economics, etc., fail.

The recent much-discussed replication crisis in science is associated with many prominent stylized facts that have turned out not to be facts at all (Open Science Collaboration, 2015, Jarrett, 2016, Gelman, 2016b). Prominent examples in social psychology include embodied cognition, mindfulness, ego depletion, and power pose, as well as sillier examples such as the claim that beautiful parents are more likely to have daughters, or that women are three times more likely to wear red at a certain time of the month.

These external validity problems reflect internal problems with research methods and the larger system of scientific communication. . . .

At this point it is tempting to recommend that researchers just stop their p-hacking. But unfortunately this would not make the replication crisis go away! . . . eliminating p-hacking is not much of a solution if this is still happening in the context of noisy studies.

Null hypothesis significance testing (NHST) only works when you have enough accuracy that you can confidently reject the null hypothesis. You get this accuracy from a large sample of mea- surements with low bias and low variance. But you also need a large effect size. Or, at least, a large effect size, compared to the accuracy of your experiment.

But we’ve grabbed all the low-hanging fruit. In medicine, public health, social science, and policy analysis we are studying smaller and smaller effects. These effects can still be important in aggregate, but each individual effect is small. . . .

I then discuss two examples: the early-childhood intervention study of Gertler et al. which we’ve discussed many times, and a recent social-psychology paper by Burum, Gilbert, and Wilson that happened to come up on the blog around the time I decided to write this paper.

The article discusses various potential ways that science can do better, concluding:

These solutions are technical as much as they are moral: if data and analysis are not well suited for the questions being asked, then honesty and transparency will not translate into useful scientific results. In this sense, a focus on procedural innovations or the avoidance of p-hacking can be counterproductive in that it will lead to disappointment if not accompanied by improvements in data collection and data analysis that, in turn, require real investments in time and effort.

To me, the key point in the article is that certain classical statistical methods designed to study big effects, will crash and burn when used to identify incremental changes of the sort that predominate in much of modern empirical science.

I think this point is important; in some sense it’s a key missing step in understanding why the statistical methods that worked so well for Fisher/Yates/Neyman etc. are giving us so many problems today.

P.S. There’s nothing explicitly Bayesian in my article at all, but arguably the whole thing is Bayesian in that my discussion is conditional on a distribution of underlying effect sizes: I’m arguing that we have to proceed differently given our current understanding of this distribution. In that way, this new article is similar to my 2014 article with Carlin where we made recommendations conditional on prior knowledge of effect sizes without getting formally Bayesian. I do think it would make sense to continue all this work in a more fully Bayesian framework.

8 thoughts on “The failure of null hypothesis significance testing when studying incremental changes, and what to do about it

  1. social science cannot be as exact as physics or chemistry

    Decades of NHST has made the author give up on their research topic. Either get out or overcome the obstacle, it is insanity to continue in the same ways (I tried really hard to overcome, then got out btw).

    Why not go back and read the pre-1940ish literature on your topic? I have always found promising approaches that were really only limited by the computational ability of those years.

    Null hypothesis significance testing (NHST) only works when you have enough accuracy that you can confidently reject the null hypothesis. You get this accuracy from a large sample of mea- surements with low bias and low variance. But you also need a large effect size. Or, at least, a large effect size, compared to the accuracy of your experiment.

    But we’ve grabbed all the low-hanging fruit. In medicine, public health, social science, and policy analysis we are studying smaller and smaller effects. These effects can still be important in aggregate, but each individual effect is small.

    Actual science probably hasn’t been applied to your given med/social/psych problem for at least 1-3 generations at this point. I think it is a bit premature to declare the low hanging fruit is gone. Why not at least consider that this prediction has come true before busting out the “it is so complicated” excuses:

    “We are quite in danger of sending highly trained and highly intelligent young men out into the world with tables of erroneous numbers under their arms, and with a dense fog in the place where their brains ought to be. In this century, of course, they will be working on guided missiles and advising the medical profession on the control of disease, and there is no limit to the extent to which they could impede every sort of national effort.”

    Fisher, R N (1958). “The Nature of Probability”. Centennial Review. 2: 261–274.

  2. I suspect there’s lots of low hanging fruit. What it takes, however, is some genius risk-taker to figure out that some of the things hanging low are in fact fruit.

      • And once we have established that they are fruit (e.g. they have been “successfully” replicated), perhaps we can then start to try and find out which fruit is best, and for which purpose (e.g. testing and evaluating competing theories)

  3. It is not that the data is noisy, it’s that the theory is weak to non-existent. In such a condition only the cases where the effect is large will give you useful information. Perhaps consilience offers the only way forward, not to test for small effects but to build structures where many studies with small but different effects and noisy data mutually reinforce.

  4. I doubt that all the low-hanging, huge effect size biomedical fruit has been harvested. The story of Helicobacter pylori should remind us that NHST may not only lead to poor inferences it often obscures otherwise obvious truths. Peptic ulcer disease had all the hallmarks of an infectious process yet a pubmed search for “peptic ulcer risk factors” prior to 1991 returns several hundred papers that serve as evidence for my claim that NHST is not only a defective tool but one as prone to hide the truth as to reveal it.

    After NHST studies from the 1950s pointed to stress and other aspects of the modern world as causes of peptic ulcer researchers launched NHST studies of peptic ulcers in bus drivers, farmers, fishermen and postal workers. Others sorted their subjects by “psychic temper, environmental, work conditions and eating habits”, while yet others grouped them by socioeconomic status, marital status, age and gender. From there they moved on to testing hypothesized “ulcergenic” foods, lifestyles, jobs and pollutants. Not (to us) surprisingly, NHST produced many a satisfying “Eureka!” moment for them, and an accompanying discharge of papers to pollute the literature.

    And it only took an N of 1 study to refute them all (with the notable exception of the well-known propensity of aspirin and its cousins).

    The good news (and I’m no researcher of H. pylori or anything else so I disclaim all warranties, implied or otherwise) is that NHST is being used nowadays not to discover the causes of those cases of peptic ulcer that can’t be explained by H. pylori or NSAIDs but rather to suggest good candidates for the application of empirical methods.

    In looking up what’s new re: Alzheimer’s disease I see the same questioning of dogma – i.e. that statistical correlation proves beta amyloid plaque causes Alzheimer’s. The fact that no drug capable of preventing plaque formation or of dissolving it once formed reverses, halts or even markedly slows the advance of the disease suggests a confounder that would upend the plaque/Alzheimer’s paradigm. Maybe it’ll pan out, and maybe not but I’m willing to bet that ICD K27.0-9 (peptic ulcer) aren’t the only low-hanging fruit to be found among the ICD10’s 68,000 diagnoses and symptoms.

    And maybe there’s something in this revolution for psychology too: https://www.sciencedirect.com/science/article/pii/S0889159115000884

Leave a Reply

Your email address will not be published. Required fields are marked *