## “The difference between . . .”: It’s not just p=.05 vs. p=.06

The title of this post by Sanjay Srivastava illustrates an annoying misconception that’s crept into the (otherwise delightful) recent publicity related to my article with Hal Stern, he difference between “significant” and “not significant” is not itself statistically significant.

When people bring this up, they keep referring to the difference between p=0.05 and p=0.06, making the familiar (and correct) point about the arbitrariness of the conventional p-value threshold of 0.05. And, sure, I agree with this, but everybody knows that already.

The point Hal and I were making was that even apparently large differences in p-values are not statistically significant. For example, if you have one study with z=2.5 (almost significant at the 1% level!) and another with z=1 (not statistically significant at all, only 1 se from zero!), then their difference has a z of about 1 (again, not statistically significant at all). So it’s not just a comparison of 0.05 vs. 0.06, even a difference between clearly significant and clearly not significant can be clearly not statistically significant.

The .05 vs .06 thing is fine, but I fear that it obscures our other point, and it could mislead researchers who might think they are safe if they think they can draw firm conclusions from apparently large p-value differences (for example, .10 vs. .01).

1. Karl Broman says:

In other words, absence of evidence is not evidence of absence.

• revo11 says:

Taleb’s aphorisms are fun and all, but I don’t think that’s the main point. My reading of this is more along the lines of “we don’t even understand our definition of ‘evidence’ when it comes to significance testing” (not quite as aphorism-sounding but oh well).

Thanks for this clarification, my first reaction from the title of this paper was also “this must be something about the threshold of significance not being well defined” too. Will have to put it in the queue of stuff-I-should-read-in-the-near-future.

• Bruce McCullough says:

But absence of evidence IS evidence of absence!
See, e.g., http://oyhus.no/AbsenceOfEvidence.html for an elementary proof.

• Karl Broman says:

I’m not convinced by the argument there. The cases given (of absence of evidence) are in fact instances of evidence of absence.

That A is evidence for B implies “not A” is evidence for “not B” is perfectly reasonable. But I would call “not A” evidence, and by “absence of evidence” I would mean that we have no information about whether or not A occurred.

• LemmusLemmus says:

As Elizer Yudkowsky pointed out, absence of evidence is evidence of absence to the extent that we should expect evidence for claim X to exist if claim X were true.

• revo11 says:

This probably requires more thought than I can muster in a blog comment, but one issue with applying the proof is that evidence is described in terms of the probability of a hypothesis.

I’m no fan of hypothesis testing, but the vast majority of these studies are framed as a hypothesis tests. It would be wrong to say that a significant p-value is equivalent to definition 1 (where B is the null hypothesis being false) since hypothesis testing does not say anything about the probability of a model. Perhaps it makes more sense to apply the proof within a Bayes Factor approach to model selection, but that would be somewhat dependent on the prior on the space of models being correct.

• Karl Broman says:

The example I usually think about is of, say, significant evidence for a treatment effect in females (say p=0.04) but not in males (say p=0.10). We can’t say that there is no treatment effect in males. And (along the lines of the topic here), we can’t say that the treatment effect is different in females than in males. For that, we need to explicit examine the treatment-by-sex interaction.

2. Ian Fellows says:

It can be a subtle point of language. It is perfectly fine to say “as opposed to previous studies which found no relationship between A and B (p=0.05000001), our study found a significant correlation (p=0.049999998).” However, it is not fine to say “The correlation between A and B in our study population differs from previous studies.”

• Andrew says:

Ian:

Please read the title above. I’m not just talking about 0.051 vs. 0.049. The same issue arises when comparing a clearly “significant” p-value of 0.01 to a clearly “non-significant” 0.1.

• Ian Fellows says:

The point I was making about what constitutes acceptable language for publication applies equally well for the .1/.01 p-value difference. The first sentence is descriptive about what was found in the different studies, whereas the second makes a claim that the populations under study are different. It is always okay to use the language in the first sentence, whereas the second sentence requires a formal test. I’ve found that it is easy for researchers to slip into saying the second when they mean the first.

3. John Mashey says:

I am often struck by different approaches to statistical significance.
My first 10 years post-PhD were spent at Bell Labs, which had some fine statisticians and many others who needed to use good statistics,
a) in scientific research
b) in engineering analyses
c) in planning systems, network systems, etc. For instance, one really needs to know about “busy-hours” and such.

I can’t remember people worrying much about specific significance levels.
Of course, with Tukey around, EDA was common, just to get insight about what was going on.

More to the point were questions like:
a) Given choices A vs B, how strong is the data supporting one over the other?
[Of course, in many cases, the design space was more complex than A vs B.]
b) If there isn’t enough data, can we get more in a timely fashion? Here is where significance arguments might appear.
c) At some point, we will have to make decisions, with the best information we have at that point.

4. Also, of course, sometimes the difference between two statistical non-significant results can be statistically significant, for example z = 1.1 vs z = -.93

statistical significance is a really poor indicator of anything except maybe as a first pass when trying to filter a large number of datapoints to look more carefully at a few of them which seem statistical inconsistent with some base model. Even then, in the absence of a more careful analysis method for the filtered ones you’re just asking to find noise (a-la sex ratios of engineers or whatever).

5. Jonathan says:

Hey Andrew,
Of interest:
http://www.wired.com/magazine/2011/12/ff_causation/all/1

This is the new article by Jonah Lehrer on causation!

• Dean Eckles says:

I found that article to be a frustratingly poor treatment of the issues. There is a bunch of hand waving to connect “Humean” skepticism about causality to the current problems in medical science. But no real connection.

Actually, made me think I should cancel my (quite cheap) Wired subscription.

• Jonathan says:

Yeah I had not read the article ahead of time. I found it terrible. I think Jonah Lehrer is becoming as bad as Malcolm Gladwell. Can you imagine what Lehrer thinks if he read anything about natural experiments and causality, he might have a heart attack.

6. LemmusLemmus says:

Apparently, this is a point many neuroscience papers get wrong: