P-values and statistical practice

From my new article in the journal Epidemiology:

Sander Greenland and Charles Poole accept that P values are here to stay but recognize that some of their most common interpretations have problems. The casual view of the P value as posterior probability of the truth of the null hypothesis is false and not even close to valid under any reasonable model, yet this misunderstanding persists even in high-stakes settings (as discussed, for example, by Greenland in 2011). The formal view of the P value as a probability conditional on the null is mathematically correct but typically irrelevant to research goals (hence, the popularity of alternative—if wrong—interpretations). A Bayesian interpretation based on a spike-and-slab model makes little sense in applied contexts in epidemiology, political science, and other fields in which true effects are typically nonzero and bounded (thus violating both the “spike” and the “slab” parts of the model).

I find Greenland and Poole’s perspective to be valuable: it is important to go beyond criticism and to understand what information is actually contained in a P value. These authors discuss some connections between P values and Bayesian posterior probabilities. I am not so optimistic about the practical value of these connections. Conditional on the continuing omnipresence of P values in applications, however, these are important results that should be generally understood.

Greenland and Poole make two points. First, they describe how P values approximate posterior probabilities under prior distributions that contain little information relative to the data:

This misuse [of P values] may be lessened by recognizing correct Bayesian interpretations. For example, under weak priors, 95% confidence intervals approximate 95% posterior probability intervals, one-sided P values approximate directional posterior probabilities, and point estimates approximate posterior medians.

I used to think this way, too (see many examples in our books), but in recent years have moved to the position that I do not trust such direct posterior probabilities. Unfortunately, I think we cannot avoid informative priors if we wish to make reasonable unconditional probability statements. To put it another way, I agree with the mathematical truth of the quotation above, but I think it can mislead in practice because of serious problems with apparently noninformative or weak priors. . . .

I really like this article. At its center are three examples: “A P value that worked” (to dismiss a hypothesis of fraud in a local election), “A P value that was reasonable but unnecessary” (in our estimates of the effects of redistricting) and “A misleading P value” (from the notorious Daryl Bem). My statistical thinking has changed a lot in the past few years—more and more, I’ve been favoring informative priors, in that way I’m going with the entire statistical and machine learning communities which have been moving away from least squares and toward regularization—and Sander Greenland has been a big influence on my attitudes here, so it was great to have an opportunity to explore these ideas in the context of his paper, and in a journal where I’d never published before (#97).

Greenland and Poole’s original article does not appear to be available online, but here’s the abstract, and here’s their rejoinder to my discussion. One reason my article came out so well is that, after writing it, I sent it to Greenland, who pointed out a number of places where I’d misunderstood what he’d written. We went through a few iterations. It was annoying at first, but at any point I could’ve stopped and just published what I had. Instead I stuck it out, swallowed my pride, and ended up with something much improved.

Greenland is one tough town, indeed.

10 thoughts on “P-values and statistical practice

  1. Pingback: Gelman’s Problems with P-Values « A (Budding) Sociologist's Commonplace Book

  2. If only people would routinely put their papers up on the web, all this pay-per-view nonsense would soon enough collapse. I’ve never come across a journal that objects to this practice, and many have policies that explicitly allow it (though perhaps it could vary across fields).

  3. > stuck it out, swallowed my pride, and ended up with something much improved

    Going for better science over (just) enhancing one’s reputation, always a good move for patients and their families (it is medical research after all!)

    (And Sander is not the easiest to work with)

  4. This is little off-topic, but imagine you’re analyzing a satellite image trying to detect a rare object on the basis of it’s color, i.e., imagine there’s not enough spatial resolution in the image to unambiguously identify the object you’re looking for but that there’s enough spectral information (>>3 colors) for color-based tests to have good power.* A typical image contains on the order of 10^5 pixels – perhaps an order of magnitude more than that. As a practical matter, for the rare-object-in-image problem, the threshold for rejecting the null hypothesis corresponds to p<1 “reject null” for every 5×5 block of pixels.) If you set the threshold that high it would be like having the smoke detector in your house going off every 15 minutes. Almost all the alarms would be false positives, you’d probably become conditioned to ignore them, and then you’d be in really big true when one of the alarms was for real. With that example in mind, it seems to me that the p-value you choose for your “reject the null” threshold needs to follow from your application and your data. Arbitrarily setting it at 0.05, or any other value, would seem to miss the forest for the trees.

    *See http://www.ll.mit.edu/publications/journal/pdf/vol14_no1/14_1hyperspectralprocessing.pdf for an overview of the sort of problem I’m talking about.

    • This is a circumstance where the Neyman-Pearson variant of significance testing shows its weakness. Instead of looking for P less than an arbitrary threshold, inspect the P values directly. The lowest P value corresponds to the location that the test statistic indicates to be the most extreme. Is that one interesting? If yes, then is the next one interesting? Iterate.

      The idea that P values only say less than or greater than something ignores most of the information in the P value. P values should not have to carry a poor reputation that is really a consequence of them being squeezed into a Neyman-Pearsonian hypothesis testing framework.

      • Agreed about N-P. Standard practice for detecting spectrally-structured targets is to use a likelihood ratio test to decide between target absent (H0) and target present hypotheses (H1), i.e., for observation x decide between H0 and H1 based on prob(x|H1)/prob(x|H0). The H0 and H1 hypotheses are defined by signal models. In addition to the likelihood ratio, it’s a good idea to look at fit residuals. (The likelihood ratio just tells you whether the H1 model fits the data better than H0. Just because H1 fits the data better than H0 doesn’t mean it fits it well.) One might reject H1 if the likelihood ratio were favorable but the p-value for the residuals unfavorable. Here’s a pretty good paper describing such an approach – http://www.opticsinfobase.org/oe/abstract.cfm?uri=oe-17-20-17391

  5. Oops, significant typo there. “… the threshold for rejecting the null hypothesis corresponds to p<1 “reject null” for every 5×5 block of pixels." should have read "… the threshold for rejecting the null hypothesis corresponds to p much much less than 0.05. A threshold of p=0.05 would correspond to more than one “reject null” for every 5×5 block of pixels."

  6. Pingback: Somewhere else, part 32 | Freakonometrics

Comments are closed.