Skip to content

No to inferential thresholds

Harry Crane points us to this new paper, “Why ‘Redefining Statistical Significance’ Will Not Improve Reproducibility and Could Make the Replication Crisis Worse,” and writes:

Quick summary: Benjamin et al. claim that FPR would improve by factors greater than 2 and replication rates would double under their plan. That analysis ignores the existence and impact of “P-hacking” on reproducibility. My analysis accounts for P-hacking and shows that FPR and reproducibility would improve by much smaller margins and quite possibly could decline depending on some other factors.

I am not putting forward a specific counterproposal here. I am instead examining the argument in favor of redefining statistical significance in the original Benjamin et al. paper.

From the concluding section of Crane’s paper:

The proposal to redefine statistical significance is severely flawed, presented under false pretenses, supported by a misleading analysis, and should not be adopted.

Defenders of the proposal will inevitably criticize this conclusion as “perpetuating the status quo,” as one of them already has [12]. Such a rebuttal is in keeping with the spiritof the original RSS [redefining statistical significance] proposal, which has attained legitimacy not by coherent reasoning or compelling evidence but rather by appealing to the authority and number of its 72 authors. The RSS proposal is just the latest in a long line of recommendations aimed at resolving the crisis while perpetuating the cult of statistical significance [22] and propping up the flailing and failing scientific establishment under which the crisis has thrived.

I like Crane’s style. I can’t say that I tried to follow the details, because his paper is all about false positive rates, and I think that whole false positive thing is a inappropriate in most science and engineering contexts that I’ve seen, as I’ve written many times (see, for example, here and here).

I think the original sin of all these methods is the attempt to get certainty or near-certainty from noisy data. These thresholds are bad news—and, as Hal Stern and I wrote awhile ago, it’s not just because of the 0.049 or 0.051 thing. Remember this: a z-score of 3 gives you a (two-sided) p-value of 0.003, and a z-score of 1 gives you a p-value of 0.32. One of these is super significant—“p less than 0.005”! Wow!—and the other is the ultimate statistical nothingburger. But if you have two different studies, and one gives p=0.003 and the other gives p=0.32, the difference between them is not at all remarkable. You could easily get both these results from the exact same underlying result, based on nothing but sampling variation, or measurement error, or whatever.

So, scientists and statisticians: All that thresholding you’re doing? It’s not doing what you think it’s doing. It’s just a magnification of noise.

So I’m not really inclined to follow the details of Crane’s argument regarding false positive rates etc., but I’m supportive of his general attitude that thresholds are a joke.

Post-publication review, not “ever expanding regulation”

Crane’s article also includes this bit:

While I am sympathetic to the sentiment prompting the various responses to RSS [1, 11, 15, 20], I am not optimistic that the problem can be addressed by ever expanding scientific regulation in the form of proposals and counterproposals advocating for pre-registered studies, banned methods, better study design, or generic ‘calls to action’. Those calling for bigger and better scientific regulations ought not forget that another regulation—the 5% significance level—lies at the heart of the crisis.

As a coauthor of one of the cited papers ([15], to be precise), let me clarify that we are not “calling for bigger and better scientific regulations, nor are we advocating for pre-registered studies (although we do believe such studies have their place), nor are we proposing to “ban” anything!, nor are we offering any “generic calls to action.” Of all the things on that list, the only thing we’re suggesting is “better study design”—and our suggestions for better study design are in no way a call for “ever expanding scientific regulation.”


  1. Harry Crane says:

    Thanks for posting.

    I’ll clarify the last point you make about post-publication review in the next revision. Best not to lump “Abandon statistical significance” with the others. The attitude of “abandon” and the Gelman-Henng subjective-objective paper is as close to my point of view as anything else I’ve seen. Thanks for that pointing out.

  2. Jonathan says:

    Law uses a variety of methods: beyond a reasonable doubt in criminal cases, and then clear and convincing versus preponderane of the evidence in various other cases. When you add in comparative negligence, meaning fault may be divided into percentages, you get a fairly detailed picture of trying to find answers in relatively uncertain matters. So for example, you may see a case where a party loses by preponderance of the evidence, which is something more than 50%, but then is found comparatively negligent so the actual recovery is reduced by that. These are real life, money on the line judgements. If you treat a jury or judge as an analytical package running a series of statistical tests, then law has developed a better way of handling basic noisy relative uncertainty than many sciences. No one knows what ‘beyond a reasonable doubt’ means but you can think of it as an evaluation of this or that versus how that fits into the causative chain that’s at issue in the case. It’s easy to see that as a signficance test but you also see the difficulty: it requires a model of the causative chain that describes what happened and why and which branches at many levels. People grasp that model when presented by prosecution and defense – and they bring their own ideas to it – but imagine if this had to be fit to a standard model and run through a package that determines if this doubt rises to the right p level. It’s absurd because so many judgements are made about how the facts are imagined, laid out and presented and how the people fit them together. In fact, I see a jury as a series of models running the facts and links presented to them as each juror has heard them and that continues until the iterations coalesce around a solution.

  3. Jacob says:

    In general, the well-intended responses to the “replication crisis” have at times struck me as overly dogmatic and consequently fairly dangerous. I appreciate Andrew’s blog because while maybe some might want to describe Andrew as dogmatic, on closer reading you will find that this blog espouses a more open-minded approach to statistical inference. To the extent this blog has a dogma, it’s an anti-dogma dogma. I’m appreciative of that since it’s easy to take this blog’s influential takedowns of crap science produced by a bad set of rigid practices and then insist on replacing them with a new set of rigid practices.

    When I took a course on Shakespeare in undergrad, we spent some time on his sonnets. The thing about Shakespeare’s (and many other greats’) sonnets is that he treats the rules of the sonnet as overly binding but a necessary evil. Shakespeare ostensibly writes sonnets but at his best subtly breaks the rules or follows them in ways that produce thought-provoking contradictions. Sometimes what makes it great art is that he manages to express an idea in a format that seems so incompatible with clearly communicating it. While we maybe should not treat art as completely in opposition to science, the way imposing a challenging set of rules is conducive to great art may not be so useful for doing good science. My fear is we are deciding the sonnet is bad but debating whether the limerick or haiku is the better alternative.

    As for the RSS idea and related ones, I have generally been concerned with the strong statements unsupported by data. To be sure, we’re talking about potential outcomes that would be difficult to impossible to measure without just going through the pain of implementing them, but my feeling is that the various proposals for improving science are presented with a bit too much confidence about their positive effects and too little reckoning of their potential for unintended negatives and backfiring on their primary goals. And I do not appreciate how in scientific debates arguments are sometimes not made in good faith; the many authors behind RSS seem to admit they advocate for the more stringent threshold because it might push people to consider alternatives to testing a point-null with frequentist statistics, but history shows us that it’s usually not the nuanced clarification but the big idea that survives. If the goal is getting rid of fetishized thresholds or faulty methods, argue against them rather than making a bold claim that isn’t meant to be taken completely seriously.

    • Harry Crane says:

      “If the goal is getting rid of fetishized thresholds or faulty methods, argue against them rather than making a bold claim that isn’t meant to be taken completely seriously.”

      I completely agree. That’s why I made the argument using the same data and the same approach (false positive rate) as in the Benjamin, et al paper. I’m not a proponent of P-values, false positive rates, or any other statistical method, but it’s no use to be overly dismissive of these other proposals. RSS is a very bad, disingenuous proposal. But it’s not enough to dismiss false positive rates and significance levels as irrelevant and move on. This misses the point that there are a lot of people (evidently the Benjamin, et al authors) who think they are very relevant. The counterargument should be directed toward those who disagree, not those who are already on board.

    • Keith O'Rourke says:

      > goal is getting rid of fetishized thresholds or faulty methods, argue against them rather than making a bold claim that isn’t meant to be taken completely seriously.

      Is that really the goal of _redefining papers_?

      Could they be bids for prestige (i.e. making a bold claim) to enhance the careers of the authors?

      (Also don’t think Harry’s comment below – very bad, disingenuous proposal – is overly harsh.)

      • Harry Crane says:

        > Also don’t think Harry’s comment below – very bad, disingenuous proposal – is overly harsh.

        Not sure whether you do or don’t think my comment is overly harsh. But here’s why it’s not:

        Why proposal is “very bad”: has already been covered in detail on this blog and also in the responses by Trafimow, et al, McShane, et al, Amrhein and Greenlad, and numerous others.

        Why proposal is “disingenuous”:

        1) Authors make quantitative claims about the benefits of this proposal (FPR will decrease by at least factor of 2, replication rate will double) which my analysis shows is misleading. Whether or not benefits would actually come to fruition is unknowable right now, but there is no evidence to suggest that they would. Yet they are presented in the original article as fact.

        2) Authors themselves admit in the introduction that they do not all agree that RSS is the best approach, but they are putting forward something that they think will “quickly achieve broad acceptance” — the illusion of progress.

  4. Martha (Smith) says:

    “I think the original sin of all these methods is the attempt to get certainty or near-certainty from noisy data. “

    Well put!

  5. I don’t think commentary is dogmatic. I’m not sure though we can conceive better study design unless preregistrations, preprints, blogs, are accessible. Point being that most of us are, even under best of circumstances, without transparency and statistical perspicuity, may get it wrong. It’s a view I share with Dr. John Ioannidis. I admit, now on closer inspection that I favor abandoning statistical significance & p-values. It’s that I’m not confident that other measurement tools can really control for noisy data. I hear that Dr. Ioannidis has voiced that statistical methods have improved in biomedical enterprises. I haven’t really found clear plain english explanations for why such broad assertion.

  6. Carlos Ungil says:

    If the difference between a z-score of 1 and a z-score of 3 is “not at all remarkable”, at what point would it become remarkable? Je

  7. Anonymous says:

    From the pre-print:

    “To borrow from Feyerabend (2010, p. 7), “The only principle that does not inhibit progress is: anything goes”.”
    “The only way to reverse course is to loosen—not tighten—the restrictions on what makes an analysis scientific and a finding significant.”

    These 2 sentences confuse me.

    I don’t agree with the Feyerabend quote as i interpret it, as i reason that if “anything goes”, a scenario can be thought of that would inhibit progress (however you would define it). For instance, if there are no specific rules and/or methods to search for and validate “knowledge”, everything becomes an assumption and/or opinion. If everything becomes an assumption and/or opinion, i would reason that would not result in much progress (?). Also, if we are talking about “scientific” progress, doesn’t that imply that “science” is involved, and if so, does this not imply that there are certain principles/rules/etc. at work which should be adhered to (?).

    This perhaps also relates to the other sentence about “loosening restrictions”. I don’t get this either, or perhaps i think it might be important to make clear what “restrictions” mean/imply. I am very bad at statistics, so please forgive me (and possibly correct me) if i am wrong. But, i understood that p-hacking is “wrong” because you no longer adhere to certain assumptions and/or this invalidates the diagnostic value of p-values.

    Regardless of whether i understood things correctly regarding p-hacking/p-values, i am trying to come up with an example of something in science that has certain “rules” attached to it. If these can be found in science, then i reason formulating “restrictions” can be most useful, and “loosening restrictions” can be detrimental to science…

      • Anonymous says:

        Thank you for the link! I’ve read it, and it could be too complicated for me. Regardless, perhaps it allows me to try and make my point in another way.

        When reading the post you link to, 2 things caught my eye:

        1) It appears to me rational thought (which may in itself rest on certain assumptions), and reasoning, is being used to discuss something science-related.
        2) There is a critical stance towards (statistical) methodologies expressed in the post and discussion.

        Now, my point is that perhaps these 2 things (rational thought and being critical) can be seen as being in line with scientific principles. If this makes any sense, you could for instance agree on using sound and valid reasoning for discussions about a certain scientific method (e.g. use of statistics), and you would hereby agree to some “rules” (e.g. appeal to authority is not allowed). When someone would then point out an appeal to authority, you could say that person is “restricting” something in science, but i.m.o. and reasoning this is necessary for science to work, and possibly progress.

        From this point of view, i still reason “anything does *not* go”, and some “restrictions” are necessary in science. If you are playing a game, you should stick to the rules, or else you’re not playing the game anymore.

        Perhaps this can also be related to what the author of the pre-print has done. In his own words (see a reaction above): “That’s why I made the argument using the same data and the same approach (false positive rate) as in the Benjamin, et al paper.”. In a way, you could perhaps say that even though he might not even agree with the method/statistic (false positive rate), within the “rules” regarding false-positive rates he provides addtional analyses/information and comes to a different/more nuanced conclusion.

        In my view, he himself in this case does not adhere to “anything goes”, and in a way “restricts” the authors of the RSS-paper in their conclusions, by “playing by the same rules” in this particular “game of science”. That’s exactly my point i think.

        I wonder if the paper would be better if the type of sentences i was/am confused about would be left out of the paper, as i feel they do not serve a purpose given the rest of his paper and can only lead to possible confusion. They can also lead to being critical (which might be a good thing and in line with scientific principles), but i think it would be better to save that stuff for a different kind of paper. Just my 2 cents.

        • Harry Crane says:

          I’d recommend to look at more original sources about Feyerabend and related work by Thomas Kuhn

          Quoting Feyerabend regarding “anything goes”: “‘anything goes’ is not a ‘principle’ I hold… but the terrified exclamation of a rationalist who takes a closer look at history.” In other words, “anything goes” is an observation about the way in which science has progressed throughout history. It often progresses (a la Kuhn) by *revolution*, which involves among other things a break from the norms/paradigms from which it is progressing.

          As for whether I adhere to “anything goes” in my response. “Anything goes” is not a principle to *adhere* to. So I don’t adhere to it, or anything else. But also, it is worth noting that my response doesn’t make *progress* either.

          >If you are playing a game, you should stick to the rules, or else you’re not playing the game anymore.

          Right now, the “rules” involve P<0.05. Why should anyone want to play this game?

          Another point of the conclusion is: who gets to decide these rules? Whoever is setting the current rules is doing a pretty bad job.

          • Anonymous says:

            Thank you for your reaction!

            “In other words, “anything goes” is an observation about the way in which science has progressed throughout history.”

            I am personally not really interested in how things have gone, only how they could/should go. I think it is reasonable to conclude that there has been scientific progress in the course of human history. If this makes sense, i am only interested in finding out why and how there has been made progress, in order to try and come up with important principles/rules for science to uphold. So, only in light of my priority of finding out how and why science could/should work am i possibly interested in how things have gone in the past.

            “As for whether I adhere to “anything goes” in my response. “Anything goes” is not a principle to *adhere* to. So I don’t adhere to it, or anything else”

            Just for possible clarification, I didn’t mean to state anything about you adhering or not adhering to “anything goes”. I tried to use it as an example of trying to make my point about “anything goes/does not go”, “restrictions”, and “progress”.

            “But also, it is worth noting that my response doesn’t make *progress* either.”

            Aha, that is where i disagree, and that may be why i might view things differently regarding “anything goes”, and “restrictions”. I am not smart enough to check your analyses, but assuming they are correct and i am interpreting (the gist of) your paper correctly, i think you may have made a valid and useful contribution to the RSS discussion! I think that’s a pre-requisite/part of scientific progress.

            “Right now, the “rules” involve P<0.05. Why should anyone want to play this game? Another point of the conclusion is: who gets to decide these rules? Whoever is setting the current rules is doing a pretty bad job."

            I am not smart enough to answer these questions. I just wanted to make my point about "anything goes" and "restrictions". For instance, regarding p-values and possibly valid "restrictions" concerning the use of them in relation to for instance p-hacking.

            I hope this makes sense, and makes my point clear: If you want to play the game (science), you play by the rules (p-values) or propose new rules (insert appropriate (statistical) method here). What you can't do in my reasoning is saying you play by the rules (p-values) but not really doing that (p-hacking).

          • Keith O'Rourke says:

            For those willing, I would suggest C.S. Peirce in particular his first, and in one sense sole, rule of reason – Do not block the way of inquiry. (Yes, a terrified exclamation of a rationalist who did take a close look at history.)

            But then as regular readers of this blog know, I always do suggest this ;-)

            This might suffice “Although it is better to be methodical in our investigations, and to consider the economics of research, yet there is no positive sin against logic in trying any theory which may come into our heads, so long as it is adopted in such a sense as to permit the investigation to go on unimpeded and undiscouraged.”

  8. I believe that there are individuals with exceptional perspicuity in every arena that can evaluate merits and demerits to particular method. As for measurement, I intuit that improvements will come from outside statistics and social science communities. I think that it will require a leap of imagination in the best sense of Feyerabend’s ‘anything goes/

  9. Malcolm says:

    “the ultimate statistical nothingburger” hahah. Among your many written contributions to statistics, this is at the very top.

  10. Thanatos Savehn says:

    Assume that you are a judge charged with an evidentiary gatekeeping role whereby you are not to make TRUE/FALSE decisions about the claim made in a proffered scientific paper but rather to decide whether it is (a) testable; (b) tested; and, (c) supported by the analysis reflected in the paper. Take (a) and (b) (via NHST) as given and (c) is p=0.49. Do you allow an expert to rest his “to a reasonable degree of medical probability I believe Defendant Dr. Crane committed malpractice” on this paper? If not, why not?

    • Did you mean 0.049 ie just on edge of traditional 0.05 threshold

    • Dzhaughn says:

      I think proposition (c) p = .49 and proposition (b) the claim is tested are very much in tension with each other, if not completely contradictory.

      If Dr. Crane administered a treatment based on a theory tested in a single paper with a p = .49 NHST, then he is morally guilty of malpractice. What if the legal system did the same to Dr. Crane?

      • Thanatos Savehn says:

        Exactly. And yet, not long ago, a court ruled in the midst of this “anything can be evidence of causation; it’s all down to ‘professional judgment’ after all, right?” craze that so long as p.5 in support of “X more likely than not DOES cause Y”. What I’m trying to ask, inartfully, is the following: Is there some p-value less than 0.5 that all could agree is no evidence of anything?

  11. Anonymous says:

    And here is the reaction by (at least one) of the RSS-paper authors:

    I am left confused. It appears to me, (at least one of the) RSS-author does not engage in any discussion about the main point of the criticism by Crane…

    This is similar to me as “the old boys network” of publishing in high impact factor journals, and having an “official reply” by the other party whilst not engaging in any real discussion, and simply repeating one’s message…

    To be fair, the RSS-author(s) does give Crane a chance to speak, which is included in the post. I think that’s nice. And i think that reply by Crane in turn makes it even more clear that the RSS-author(s) may have nothing more to say on this, and/or are reluctant to acknowledge a possibly valid and important point by Crane…

    I hope i am not smart enough to figure out what’s going on here, but if i am interpreting this all correctly i am deeply disappointed by the RSS-author(s) to say the least…

Leave a Reply