Skip to content

“The Statistical Crisis in Science”: My talk in the psychology department Monday 17 Nov at noon

Monday 17 Nov at 12:10pm in Schermerhorn room 200B, Columbia University:

Top journals in psychology routinely publish ridiculous, scientifically implausible claims, justified based on “p < 0.05.” And this in turn calls into question all sorts of more plausible, but not necessarily true, claims, that are supported by this same sort of evidence. To put it another way: we can all laugh at studies of ESP, or ovulation and voting, but what about MRI studies of political attitudes, or embodied cognition, or stereotype threat, or, for that matter, the latest potential cancer cure? If we can’t trust p-values, does experimental science involving human variation just have to start over? And what to we do in fields such as political science and economics, where preregistered replication can be difficult or impossible? Can Bayesian inference supply a solution? Maybe. These are not easy problems, but they’re important problems.

Here are the slides (which might be hard to follow without hearing the talk) and here is some suggested reading:


Too Good to Be True

The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time

Slightly technical:

Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors

The Connection Between Varying Treatment Effects and the Crisis of Unreplicable Research: A Bayesian Perspective


  1. Jonathan (another one) says:

    Open to the public?

  2. question says:

    “In the sciences, an experimentum crucis (English: crucial experiment or critical experiment) is an experiment capable of decisively determining whether or not a particular hypothesis or theory is superior to all other hypotheses or theories whose acceptance is currently widespread in the scientific community.”

    In most of these “significant p-value -> my theory is true” cases there is no real “prediction” of the “theory” in the sense that the prediction may be distinct from those of many other theories. It is just that no one has bothered to come up with an alternative explanation for why the two means aren’t equal, etc. Further, all the averaging has hidden the information that may be inconsistent with the “theory”, creating an obstacle to others abducing alternative explanations.

  3. Jason Thomas says:

    I really wish I could attend this talk because it sounds amazing, but I’m not going to be in NYC. I also really wish U.S. universities made the effort that the British universities did to make sessions open (Google: LSE podcasts for the most gleaming example). Open access to ideas is just as powerful as open access to data, reproducibility and open science

    • Rahul says:

      I’m skeptical that US universities are behind British ones about openness in any systematic sense.

      • Jason Thomas says:

        Consider your skepticism smoothed. Putting undergraduate courses that are reduced in content from the originals on Open Courseware, Coursera or EdX or anchored short interview podcasts like HBS Ideacast is not the same as actually making publicly available the talks, panels and debates (and data) where ideas are debated amongst experts and the boundaries of ideas probed. No U.S. university that I know of does this, other than Chicago Law. It’s hard to argue that there is to value to be had by someone plugging an $80 digital recorder into the microphone for this talk and making it available online.

  4. Rahul says:

    ” If we can’t trust p-values, does experimental science involving human variation just have to start over?”

    Isn’t the qualifier about human variation redundant? If we cannot trust p-values we cannot trust p-values.

    • Andrew says:


      At a technical level, a lot of the problems arise when signal is low and noise is high. Various classical methods of statistical inference perform a lot better in settings with clean data. Recall that Fisher, Yates, etc., developed their p-value-based methods in the context of controlled experiments in agriculture.

      Statistics really is more difficult with humans: it’s harder to do experimentation, outcomes of interest are noisy, there’s noncompliance, missing data, and experimental subjects who can try to figure out what you’re doing and alter their responses correspondingly.

  5. Kyle C says:

    How was it received?

  6. Peter chapman says:

    I’m not going to defend p-values but it is (some) psychologists and the way they apply statistical methods that we shouldn’t trust. Moving over to Bayesian methods will not solve this problem.

    • Rahul says:


      A lot of this boils down to intent. If one really *wants* to push a certain result there’s always going to be a way to do it.

    • Andrew says:

      Peter, Rahul:

      1. I agree with you that it’s not just about p-values. The way I (and others put it), the problem is with null hypothesis significance testing, not with p-values. Null hypothesis significance testing can be done using Bayesian methods and the same problems will arise there as arise with classical p-values.

      To put it another way, there are problems where null hypothesis significance testing is appropriate, but I think these problems are rare, and I think the application of null hypothesis significance testing in science is generally misguided. And if all the p-values were changed to Bayes factors, I’d still feel this way.

      I discussed these issues a bit here.

      2. Rahul wrote, “If one really *wants* to push a certain result there’s always going to be a way to do it.” Sure, but it’s more than that. As Loken and I discuss in our Garden of Forking Paths article, a lot of these problems can arise even when researchers aren’t trying to cheat; it just comes up in analysis choices that are contingent on data.

  7. […] prepared the above image for this talk. The calculations come from the second column of page 6 of this article, and the psychology study […]

  8. […] You’ve got a thing about psychologists, like Andrew Gelman does, haven’t […]

  9. […] result (p<0.02) with a small sample size (N=12). This is precisely the sort of result that one would not expect to replicate. The evidentiary value of such a study is slim, particularly without the methods and analysis being […]

Leave a Reply