## The p-value is not . . .

From a recent email exchange:

I agree that you should never compare p-values directly. The p-value is a strange nonlinear transformation of data that is only interpretable under the null hypothesis. Once you abandon the null (as we do when we observe something with a very low p-value), the p-value itself becomes irrelevant. To put it another way, the p-value is a measure of evidence, it is not an estimate of effect size (as it is often treated, with the idea that a p=.001 effect is larger than a p=.01 effect, etc). Even conditional on sample size, the p-value is not a measure of effect size.

### 21 Comments

1. Tom says:

So, the same p-value could be interpreted by different researchers differently depending on how the decision for abandoning the null hypothesis is defined by each researcher…

2. Greg says:

I’m surprised that you accept the claim that a p-value is a measure of evidence. What about the Likelihood Principle? I know that you don’t like the Likelihood Principle as an argument for Bayesian methods and really don’t like it as an argument against Bayesian model checking (and I agree with you there), but I didn’t think that you rejected it as a claim about the evidential meaning of the data within an accepted model.

• Michael Lew says:

All of the examples that I’ve seen where P-values are inconsistent with the likelihood principle depend on the P-value being ‘corrected’ for sequential testing or negative binomial sampling (which is a type of sequential testing). P-values that are calculated without regard to the stopping rules are entirely consistent with the likelihood principle.
P-values used as evidence are ‘post-data’, but P-values mis-used as indices of error rates are ‘pre-data’ and are inconsistent with the likelihood principle. Our entire understanding of P-values seems to be contaminated by the error-decision framework of Neyman and Pearson. That is unfortunate.

• alex says:

P-values are inconsistent with the likelihood principle as they depend upon the probability under the null of observing data equal to or more extreme than that recorded. The likelihood principle is only interested in the probability of observing data equal to that recorded.

• Michael Lew says:

Your statement about the interests of the likelihood principle seems a bit bizarre. The likelihood principle says that all of the information in a data sample about an unknown parameter is contained in the likelihood function. How does the fact that P-values depend on tail areas mean that P-values are inconsistent with the likelihood principle?

People often talk as if there is some mutual incompatibility between P-values and likelihoods. There isn’t. The incompatibility is between Neyman-Pearsonian error rates and likelihood, but those error rates do not attempt to tell you about the observed data. If the distinction between global error rates and P-values as evidence is unfamiliar to you then I suggest you read my paper in the British Journal of Pharmacology (yes, it’s aimed at non-statisticians): http://www.ncbi.nlm.nih.gov/pubmed/22394284

3. John Johnson says:

In metaanalysis one exploits the connection between p-values and standardized effect sizes all the time. I agree the interpretation of p-value and effect size often gets carried too far, but there is still a connection.

• Thom says:

The connection between standardized effect size and p value is more straightforward than for effect size and p value. A standardized effect such as d is a measure of the discriminability or detectability of an effect (and significance tests are a rule for deciding you’ve detected an effect).

It is trivial to show that a small effect with measured with little error can give a larger standardized effect than a large effect measured with lots of error.

4. Richard D. Morey says:

Under what theory is the p value a measure of evidence? Or, put another way, how are you defining evidence such that the p value is a measure of it?

• Same thing here.

P-values are not a measure of evidence in the sense of a measure of support for H, see Schervish 1996

http://www.cs.ubc.ca/~murphyk/Teaching/CS532c_Fall04/Papers/schervish-pvalues.pdf

That is, a low p-value does not provide more evidence against the null in an absolute sense.

• The p-value is not a coherent measure of support, see Schervish 1996:

http://www.cs.ubc.ca/~murphyk/Teaching/CS532c_Fall04/Papers/schervish-pvalues.pdf

So th p-value by itself cannot be seem as an absolute measure of evidence, in the sense that a moderate p-value is not evidence in favor of the null even in an appropriate frequentist interpretation (you would have to check the power or severity under alternatives, as Mayo would put).

• Michael Lew says:

If you read closely that Schervish paper you will see that he says explicitly that one-tailed P-values ARE coherent. I recently asked him by email to confirm that point and he did so. Given that the most important reasons to prefer 2-tailed P-values over 1-tailed P-values relate tot he misuse of P-values as indices of Neyman-Pearsonian type I errors, we should just use 1-tailed P-values.

(I don’t understand how that Schervish paper was published with a title that promises so much more than the paper delivers.)

• Michael,

I don’t think the problem is with the two tailed p-values, but with trying to interpret it as a measure of support. Even in comparing two one-sided tests, the p-value can not be interpreted as a measure of evidence alone.

But as to examples, check these two examples from Patriota out:

http://arxiv.org/pdf/1201.0400.pdf

Examples 1.1 and 1.2.

• Michael Lew says:

I’ve looked at the paper that you linked, but I cannot decipher those examples because their mathematical language and notation are beyond my ken. However, the first one mentions Wald so I suspect that it might contain some sequential issues, and the second mentions what I take to be a Neyman-Pearson style alpha. Thus it is possible that they are both examples where P-values are calculated according to rules that are designed to serve the purposes of N-P hypothesis testing rather than Fisherian significance testing. It is the latter that claims to yield P-values that act as indices of evidence, whereas the former explicitly eschews the notion of probabilistic evidence.

This section on page 5 of that paper may support what I am saying: “This issue happens because a p-value was not designed to be a measure of evidence over subsets of Θ. We must say that p-values do exactly the job they were defined to do.”

5. Dave Giles says:

And let’s not forget that a p-value is a random variable – e.g., http://davegiles.blogspot.ca/2011/04/may-i-show-you-my-collection-of-p.html

6. Isn’t it more helpful to think of the p-value explicitly as a measure of “surprise” (rather than “evidence”) — low p-values suggest that what we’ve observed wasn’t really expected if the null were the “true” data generating process?

• Longhai Li says:

I think “inverse surprise” or “compatibility” what a p-value is measuring, except in one-sided test.

7. AL says:

The fact is that most applied scientists use p-values as a measure of evidence and of the size of the effect. The most striking example now is association studies in genetics and the so-called Manhattan plots” (http://en.wikipedia.org/wiki/Manhattan_plot). These Manhattan plots are aesthetically much fancier (showing sharp peaks at interesting locations) that their counterparts “size of the estimated effect”…

The interpretation is: “hey, this genetic marker has a very strong effect because it has a tiny p-value”.

Reasons for this? In my opinion:
a) impossibility of understanding what a p-value is “under the alternative”
b) lack of understanding from both parts (statisticians and applied guys) about the things that one is selling (p-values) and the other is buying (evidence)
c) exigences of reviewers and editors: “do what everybody else does”.

• K? O'Rourke says:

AL: In the meta-analysis context, I believe it is a sensible strategy to try to credibly salvage available _evidence_ by reverting to investigating and combine _just_ p_values, as getting defensible _common_effect sizes or likelihoods for common parameters is often hopeless (given how studies are actually conducted and reported/misreported getting a not too wrong model for size effects is hopeless).

Or at least thats why I would more often use Ingram Olkin like p-value combining methods as recently revisited by Owen than my own likelihood and Bayes model based ones in practice.

Owen, A. B. (2009). “Karl Pearson’s meta-analysis revisited”. Annals of Statistics, 37 (6B), 3867–3892.

8. Christian Hennig says:

How is the p-value not a measure of evidence *against* the H_0? Of course it cannot measure *support* of the H_0, as Schervish correctly says and as we try to teach our students again and again (without too much success, I’m afraid). But if we don’t forget that lack of evidence against H_0 is not evidence in favour of it, what’s wrong?

(I’m sending this for the second time because the first time apparently hasn’t worked. If you see this message twice, I got this wrong…)

9. […] my complaint, you’ll want to head over to Andy Gelman’s blog and read the comments on his recent blog post about p-values. Reading them makes one thing clear: not even a large group of stats wonks can agree on how to think […]

10. […] On p-values. […]