“The difference between . . .”: It’s not just p=.05 vs. p=.06

The title of this post by Sanjay Srivastava illustrates an annoying misconception that’s crept into the (otherwise delightful) recent publicity related to my article with Hal Stern, he difference between “significant” and “not significant” is not itself statistically significant.

When people bring this up, they keep referring to the difference between p=0.05 and p=0.06, making the familiar (and correct) point about the arbitrariness of the conventional p-value threshold of 0.05. And, sure, I agree with this, but everybody knows that already.

The point Hal and I were making was that even apparently large differences in p-values are not statistically significant. For example, if you have one study with z=2.5 (almost significant at the 1% level!) and another with z=1 (not statistically significant at all, only 1 se from zero!), then their difference has a z of about 1 (again, not statistically significant at all). So it’s not just a comparison of 0.05 vs. 0.06, even a difference between clearly significant and clearly not significant can be clearly not statistically significant.

The .05 vs .06 thing is fine, but I fear that it obscures our other point, and it could mislead researchers who might think they are safe if they think they can draw firm conclusions from apparently large p-value differences (for example, .10 vs. .01).

20 thoughts on ““The difference between . . .”: It’s not just p=.05 vs. p=.06

    • Taleb’s aphorisms are fun and all, but I don’t think that’s the main point. My reading of this is more along the lines of “we don’t even understand our definition of ‘evidence’ when it comes to significance testing” (not quite as aphorism-sounding but oh well).

      Thanks for this clarification, my first reaction from the title of this paper was also “this must be something about the threshold of significance not being well defined” too. Will have to put it in the queue of stuff-I-should-read-in-the-near-future.

        • I’m not convinced by the argument there. The cases given (of absence of evidence) are in fact instances of evidence of absence.

          That A is evidence for B implies “not A” is evidence for “not B” is perfectly reasonable. But I would call “not A” evidence, and by “absence of evidence” I would mean that we have no information about whether or not A occurred.

        • This probably requires more thought than I can muster in a blog comment, but one issue with applying the proof is that evidence is described in terms of the probability of a hypothesis.

          I’m no fan of hypothesis testing, but the vast majority of these studies are framed as a hypothesis tests. It would be wrong to say that a significant p-value is equivalent to definition 1 (where B is the null hypothesis being false) since hypothesis testing does not say anything about the probability of a model. Perhaps it makes more sense to apply the proof within a Bayes Factor approach to model selection, but that would be somewhat dependent on the prior on the space of models being correct.

      • The example I usually think about is of, say, significant evidence for a treatment effect in females (say p=0.04) but not in males (say p=0.10). We can’t say that there is no treatment effect in males. And (along the lines of the topic here), we can’t say that the treatment effect is different in females than in males. For that, we need to explicit examine the treatment-by-sex interaction.

  1. It can be a subtle point of language. It is perfectly fine to say “as opposed to previous studies which found no relationship between A and B (p=0.05000001), our study found a significant correlation (p=0.049999998).” However, it is not fine to say “The correlation between A and B in our study population differs from previous studies.”

    • Ian:

      Please read the title above. I’m not just talking about 0.051 vs. 0.049. The same issue arises when comparing a clearly “significant” p-value of 0.01 to a clearly “non-significant” 0.1.

      • The point I was making about what constitutes acceptable language for publication applies equally well for the .1/.01 p-value difference. The first sentence is descriptive about what was found in the different studies, whereas the second makes a claim that the populations under study are different. It is always okay to use the language in the first sentence, whereas the second sentence requires a formal test. I’ve found that it is easy for researchers to slip into saying the second when they mean the first.

  2. I am often struck by different approaches to statistical significance.
    My first 10 years post-PhD were spent at Bell Labs, which had some fine statisticians and many others who needed to use good statistics,
    a) in scientific research
    b) in engineering analyses
    c) in planning systems, network systems, etc. For instance, one really needs to know about “busy-hours” and such.

    I can’t remember people worrying much about specific significance levels.
    Of course, with Tukey around, EDA was common, just to get insight about what was going on.

    More to the point were questions like:
    a) Given choices A vs B, how strong is the data supporting one over the other?
    [Of course, in many cases, the design space was more complex than A vs B.]
    b) If there isn’t enough data, can we get more in a timely fashion? Here is where significance arguments might appear.
    c) At some point, we will have to make decisions, with the best information we have at that point.

  3. Also, of course, sometimes the difference between two statistical non-significant results can be statistically significant, for example z = 1.1 vs z = -.93

    statistical significance is a really poor indicator of anything except maybe as a first pass when trying to filter a large number of datapoints to look more carefully at a few of them which seem statistical inconsistent with some base model. Even then, in the absence of a more careful analysis method for the filtered ones you’re just asking to find noise (a-la sex ratios of engineers or whatever).

    • I found that article to be a frustratingly poor treatment of the issues. There is a bunch of hand waving to connect “Humean” skepticism about causality to the current problems in medical science. But no real connection.

      Actually, made me think I should cancel my (quite cheap) Wired subscription.

      • Yeah I had not read the article ahead of time. I found it terrible. I think Jonah Lehrer is becoming as bad as Malcolm Gladwell. Can you imagine what Lehrer thinks if he read anything about natural experiments and causality, he might have a heart attack.

  4. Pingback: Economist's View: Links for 2011-12-20

  5. Nonetheless, you can talk about differences of differences, or whatever, but a p of .05 will still only happen 5% of the time if the null hypothesis (and underlying assumptions) are true. And no matter how you cut it, 5% is not very likely, so you have good reason to suspect something is going on and investigate further. Of course, confidence intervals can be much better so you clearly see how much economic significance there is.

  6. Pingback: Extremely dichotomous » Source-Filter

Comments are closed.