Statistical Significance and the Dichotomization of Evidence (McShane and Gal’s paper, with discussions by Berry, Briggs, Gelman and Carlin, and Laber and Shedden)

Blake McShane sent along this paper by himself and David Gal, which begins:

In light of recent concerns about reproducibility and replicability, the ASA issued a Statement on Statistical Significance and p-values aimed at those who are not primarily statisticians. While the ASA Statement notes that statistical significance and p-values are “commonly misused and misinterpreted,” it does not discuss and document broader implications of these errors for the interpretation of evidence. In this article, we review research on how applied researchers who are not primarily statisticians misuse and misinterpret p-values in practice and how this can lead to errors in the interpretation of evidence. We also present new data showing, perhaps surprisingly, that researchers who are primarily statisticians are also prone to misuse and misinterpret p-values thus resulting in similar errors. In particular, we show that statisticians tend to interpret evidence dichotomously based on whether or not a p-value crosses the conventional 0.05 threshold for statistical significance. We discuss implications and offer recommendations.

The article is published in the Journal of the American Statistical Association along with discussions:

A p-Value to Die For, by Don Berry

The Substitute for p-Values, by William Briggs

Some Natural Solutions to the p-Value Communication Problem—and Why They Won’t Work, by Andrew Gelman and John Carlin

Statistical Significance and the Dichotomization of Evidence: The Relevance of the ASA Statement on Statistical Significance and p-Values for Statisticians, by Eric Laber and Kerby Shedden

and a rejoinder by McShane and Gal.

Good stuff. Read the whole thing.

For some earlier blog discussions of the McShane and Gal paper and related work, see here (More evidence that even top researchers routinely misinterpret p-values), here (Some natural solutions to the p-value communication problem—and why they won’t work), here (When considering proposals for redefining or abandoning statistical significance, remember that their effects on science will only be indirect!), and here (Abandon statistical significance).

23 thoughts on “Statistical Significance and the Dichotomization of Evidence (McShane and Gal’s paper, with discussions by Berry, Briggs, Gelman and Carlin, and Laber and Shedden)

  1. If I choose my test statistic so that it properly reflects what I’m interested in, and some kind of stupid “no effect” null model, a large p-value will still tell me that my data are well compatible with “no effect” and therefore I should not interpret them as any kind of evidence for an effect.
    Obviously this doesn’t mean I have to “believe” or “accept” the stupid null model, neither will I “believe” any specific alternative if the p-value is small. Statistical models are not there to be “believed” or “true”, they’ll shed some light on the data, not more.
    So p-values still tell me something and I’m still gonna use them, despite all the Bayesian propaganda (like in Briggs’ discussion), and despite it being clearly true that many if not most people get them wrong. (That shouldn’t be the problem of those who don’t.)
    Have fun shredding this to tears!

    • If I choose my test statistic so that it properly reflects what I’m interested in, and some kind of stupid “no effect” null model, a large p-value will still tell me that my data are well compatible with “no effect” and therefore I should not interpret them as any kind of evidence for an effect.

      This is wrong:

      A: ‘my data are well compatible with “no effect”’

      Therefore:

      C: ‘I should not interpret them as any kind of evidence for an effect’

      The large p-value is also compatible with a small effect. Or even a large effect with noisy measurements or small sample size. Why is it that arguments for NHST always contain blatant logical fallacies? It is almost as if learning NHST warps the mind.

      Also, you shouldn’t be looking for “effects” to begin with. It is a waste of time, there is an effect (ignoring some rare cases in physics where exactly zero correlation is predicted, but then that is no longer NHST since the null is predicted by theory).

      • “The large p-value is also compatible with a small effect. Or even a large effect with noisy measurements or small sample size.” This doesn’t contradict what I wrote at all.

        • Posting again…

          Your conclusion doesn’t follow from the premise (there was a P->A typo):

          P: ‘my data are well compatible with “no effect”’

          Therefore:

          C0: ‘I should not interpret them as any kind of evidence for an effect’
          C1: ‘I should interpret them as evidence for a small effect’
          C2: ‘I should interpret them as evidence for a large effect hidden by noisy measurements’
          C3: ‘I should interpret them as evidence for a large effect with small sample size’
          C4: ‘I should interpret them as evidence I need to do a subgroup analysis to find the effect’

          There is no reason to prefer C0 over anything else. You could say something like:

          C5: ‘I should not interpret them as any kind of evidence for an effect that was measured carefully with large enough sample size’

          However, there are few situations wherein the scenario described in C5 is found along with a high p-value. Pretty much the only time you come across that situation the null hypothesis is predicted by theory. Since “everything is correlated with everything else”, it just isn’t really applicable to any real world situation.

        • “There is no reason to prefer C0 over anything else.” Yes there is. C0 is about avoidance of overinterpretation, the others are (mostly) about committing overinterpretation.

  2. Below is a summary of a study from an academic paper.
    The study aimed to test how two different drugs impact whether
    a patient recovers from a certain disease. Subjects were randomly
    drawn from a fixed population and then randomly assigned to Drug
    A or Drug B. Fifty-two percent (52%) of subjects who took Drug A
    recovered from the disease while forty-four percent (44%) of subjects
    who took Drug B recovered from the disease.

    A test of the null hypothesis that there is no difference between Drug
    A and Drug B in terms of probability of recovery from the disease
    yields a p-value of 0.025.

    […]

    Assuming no prior studies have been conducted with these drugs, if
    you were a patient from the same population as the subjects in the
    study, what drug would you prefer to take to maximize your chance
    of recovery?
    A. I prefer Drug A.
    B. I prefer Drug B.
    C. I am indifferent between Drug A and Drug B.

    http://statmodeling.stat.columbia.edu/wp-content/uploads/2017/11/jasa_combined.pdf

    They choose A as the correct answer but in reality the answer is C or perhaps the missing D “not enough info”. Drug A and drug B have different side effects, different prices, etc. Shouldn’t that influence your decisions?

    Also if this is one study, who knows what the long term effects are? It could be worse than the disease. Would the patients have recovered from the disease in a few days anyway by taking no drug? The idea that a meaningful decision could even be made here is wrong.

    Anyway, their data still clearly shows the statisticians are focusing on the p-value and dichotomizing (my spell check suggests lobotomizing for that one).

    • > They choose A as the correct answer but in reality the answer is C

      Why do you think so? Is there any set of numbers (instead of 52%, 44%, 0.025) that would make you change your mind?

      • Why do you think so?

        Is it not clear from the next sentence and paragraph? The question wants me to make a decision without considering any costs or the magnitude of the benefit. I don’t want to take a drug for a cold that cures the cold but makes me allergic to everything as a side effect.

        Is there any set of numbers (instead of 52%, 44%, 0.025) that would make you change your mind?

        No, even if omniscient jones told me drug A had 100% cure rate and drug B had 0%* there would not be enough information available to make a rational decision. Assuming the data was collected carefully, etc those estimates are useful info. However, they are insufficient to make a rational decision.

        *The p-values for these cases are irrelevant visual clutter to me, I literally skip over them like you would a string of misrendered characters.

        • Is it possible to make a rational decision for a real world question, ever? No matters how much information you have, there will always be information that you do not have. If you cannot use imperfect information you’re doomed to be “C: indifferent” forever.

        • Is it possible to make a rational decision for a real world question, ever?

          Sure. For example, if someone recommends a medical treatment to you without considering costs or the magnitude of the benefit, it is a rational decision to go get your advice somewhere else.

          No matters how much information you have, there will always be information that you do not have. If you cannot use imperfect information you’re doomed to be “C: indifferent” forever.

          You seem to be forming a(nother) dichotomy between perfect and imperfect information, when that isn’t the real issue. As you say, there is always some level imperfectness to our information. The info provided for answering the question is extremely poor.

          Do people really just take/recommend medical treatments without considering the costs, side effects, or value?

        • You introduced the dichotomy. You say that the data provided is insufficient to make a rational decision.
          I’m sorry if I misrepresented your position as “anything short of perfect information is insufficient to make a rational decision.” It is clear now that what you think is “as you add data there is a point where it is suddenly sufficient to make a rational decision.”

          I think in the survey question there is an implicit “based only on the available information/ceteris paribus”. I am sure that if that premise was explicit you would agree with answer A.

        • I think in the survey question there is an implicit “based only on the available information/ceteris paribus”. I am sure that if that premise was explicit you would agree with answer A.

          Ok, but then this has no relevance to a real world situation. I think these toy statistics problems may be doing more harm than good… I just had a friend have a bad reaction to a pill because the doctor apparently just blindly prescribed it based on a described problem (no consideration of side effects, cost, lifestyle, etc).

          It is amazing to me that people actually are acting in this way. It seems so contrary to nature that I think it must be something they were trained to do.

    • Your quibbles are with a different question, of broader impact, than the one specifically asked.

      The test was about the probability of recovery from the disease. The question was “what drug would you prefer to take to maximize your chance of recovery?” Therefore, the answer was A. If the question was what was my probability of feeling better or avoiding all other diseases and consequences or something else, sure, C is the answer. But the question was specifically about what the test was about and nothing else.

      • That is true. I didn’t read the question carefully enough, pretty much stopping at “what drug would you prefer to take”. On the other hand, it doesn’t seem like there is any kind of real decision to be made then.

  3. I have an issue with the continued use of the Griskevus et al. example to diminish the benefit of CIs. I think it highlights the benefit of CIs. The CI procedure doesn’t force me to conclude the population effect is between 4 and 36% because I could also conclude, based on the finding and understanding the literature, just as Gelman et al did, that the sample is unrepresentative. And on concluding it’s unrepresentative I can decide further study is needed and not publish stupid numbers and claims.

    Consider that, if CIs are so bad then they couldn’t be used as the foundation of the argument that that finding is clearly a bad sample. The over simplification of using CIs that Andrew seems to make here seems a poor argument and is contradicted by his own use of them to show that the sample is poor and the authors shouldn’t have reached the conclusion they did. That’s a mistake by the Griskevus et al., not CIs.

    And as an aside, even if all authors used CIs over simplistically and failed to use them to help in recognizing unrepresentative samples, they’d only be incorrect about the claim of where the population value (assuming a correct CI) 5% of the time. That has to be considered an improvement over the current use of the p-value with an approximate 50% error rate.

    • Psyoskeptic:

      Conf intervals are fine, for what they are, if all of them are reported. But if the pathway to success is to report a conf interval that excludes zero, than we have all the familiar problems with researcher degrees of freedom and forking paths. So the error rate can be a lot more than 5%.

      • But that’s really back to arguing the first part of your problem with CIs… they’re just used as hypothesis tests. And, in that case, I 100% agree. You can lead researchers to CIs but you can’t make them think (Cummings). You may be unfamiliar with his paper on that but some medical journals mandated CIs and he reviewed the changes in results sections. What occurred was that people just used them like NHST and it was not change at all. (which does make one pause and wonder what the hell he’s still pushing them for if mandated ones had no effect)

        But, if just using them as an estimate isn’t really a problem then we’re probably overall in agreement but perhaps of different minds on how successful one could be at teaching them properly. I think it could be done but AFAIK there is absolutely know information (not even from Cumming) on how to report and discuss a CI as the primary inference in a paper (i.e. several examples showing how to do it right). And I’ve only ever seen it done correctly once. Maybe a paper on that from the proponents on it might be useful than blathering on about the distribution of the CI!! (OK, now I’m just on a tangent about Cumming). I just added this bit to point out that doom and gloom about CIs is often only justified from a limited perspective.

        • > But that’s really back to arguing the first part of your problem with CIs… they’re just used as hypothesis tests.

          One issue with trying to move people to things like CIs is that they _are_ just (inverted) hypothesis tests.

          I actually prefer the idea of giving a point estimator and then a bootstrapped (re-) sampling distribution for it. No test involved just a) a reasonable summary of the given data and b) what would it look like for ‘similar’ data.

          Point estimation gets a bad rap vs interval estimation but at least it isn’t as tied to hypothesis testing or e.g. ‘trapping the true value’ as CI-style interval estimation. The uncertainty seems to arise more naturally from ‘robustness’ or ‘stability’ considerations.

        • psyoskeptic:

          “I think it could be done but AFAIK there is absolutely know information (not even from Cumming) on how to report and discuss a CI as the primary inference in a paper (i.e. several examples showing how to do it right). And I’ve only ever seen it done correctly once.”

          Where was that? Genuinely curious.

Leave a Reply to Anoneuoid Cancel reply

Your email address will not be published. Required fields are marked *