John Carlin and I write:

It is well known that even experienced scientists routinely misinterpret p-values in all sorts of ways, including confusion of statistical and practical significance, treating non-rejection as acceptance of the null hypothesis, and interpreting the p-value as some sort of replication probability or as the posterior probability that the null hypothesis is true.

A common conceptual error is that researchers take the rejection of a straw-man null as evidence in favor of their preferred alternative. A standard mode of operation goes like this: p < 0.05 is taken as strong evidence against the null hypothesis, p > 0.15 is taken as evidence in favor of the null, and p near 0.10 is taken either as weak evidence for an effect or as evidence of a weak effect.

Unfortunately, none of those inferences is generally appropriate: a low p-value is not necessarily strong evidence against the null, a high p-value does not necessarily favor the null (the strength and even the direction of the evidence depends on the alternative hypotheses), and p-values are in general not measures of the size of any underlying effect. But these errors persist, reflecting (a) inherent difficulties in the mathematics and logic of p-values, and (b) the desire of researchers to draw strong conclusions from their data.

Continued evidence of these and other misconceptions and their dire consequences for science . . . motivated the American Statistical Association to release a Statement on Statistical Significance and p-values in an attempt to highlight the magnitude and importance of problems with current standard practice . . .

At this point it would be natural for statisticians to think that this is a problem of education and communication. If we could just add a few more paragraphs to the relevant sections of our textbooks, and persuade applied practitioners to consult more with statisticians, then all would be well, or so goes this logic.

Nope. It won’t be so easy.

We consider some natural solutions to the p-value communication problem that won’t, on their own, work:

Listen to the statisticians, or clarity in exposition. . . it’s not that we’re teaching the right thing poorly; unfortunately, we’ve been teaching the wrong thing all too well. . . . The statistics profession has been spending decades selling people on the idea of statistics as a tool for extracting signal from noise, and our journals and textbooks are full of triumphant examples of learning through statistical significance; so it’s not clear why we as a profession should be trusted going forward, at least not until we take some responsibility for the mess we’ve helped to create.

Confidence intervals instead of hypothesis testsA standard use of a confidence interval is to check whether it excludes zero. In this case it’s a hypothesis test under another name. Another use is to consider the interval as a statement about uncertainty in a parameter estimate. But this can give nonsensical answers, not just in weird trick problems but for real applications. . . . So, although confidence intervals contain some information beyond that in p-values, they do not resolve the larger problems that arise from attempting to get near-certainty out of noisy estimates.

Bayesian interpretation of one-sided p-values. . . The problem comes with the uniform prior distribution. We tend to be most concerned with overinterpretation of statistical significance in problems where underlying effects are small and variation is high . . . We do not consider it reasonable in general to interpret a z-statistic of 1.96 as implying a 97.5% chance that the corresponding estimate is in the right direction.

Focusing on “practical significance” instead of “statistical significance”. . . in a huge study, comparisons can be statistically significant without having any practical importance. Or, as we would prefer to put it, effects can vary: a +0.3 for one group in one scenario might become −0.2 for a different group in a different situation. Tiny effects are not only possibly trivial, they can also be unstable, so that for future purposes an estimate of 0.3±0.1 might not even be so likely to remain positive. . . . That said, the distinction between practical and statistical significance does not resolve the difficulties with p-values. The problem is not so much with large samples and tiny but precisely-measured effects but rather with the opposite: large effect-size estimates that are hopelessly contaminated with noise. . . . This problem is central to the recent replication crisis in science . . . but is not at all touched by concerns of practical significance.

Bayes factorsAnother direction for reform is to preserve the idea of hypothesis testing but to abandon tail-area probabilities (p-values) and instead summarize inference by the posterior probabilities of the null and alternative models . . . The difficulty of this approach is that the marginal likelihoods of the separate models (and thus the Bayes factor and the corresponding posterior probabilities) depend crucially on aspects of the prior distribution that are typically assigned in a completely arbitrary manner by users. . . . Beyond this technical criticism . . . the use of Bayes factors for hypothesis testing is also subject to many of the problems of p-values when used for that same purpose . . .

What do do instead? We give some suggestions:

Our own preferred replacement for hypothesis testing and p-values is model expansion and Bayesian inference, addressing concerns of multiple comparisons using hierarchical modeling . . . or through non-Bayesian regularization techniques such as lasso . . . The general idea is to use Bayesian or regularized inference as a replacement of hypothesis tests but . . . through estimation of continuous parameters rather than by trying to assess the probability of a point null hypothesis. And . . . informative priors can be crucial in getting this to work.

It’s not all about the Bayes:

Indeed, in many contexts it is the prior information rather than the Bayesian machinery that is the most important. Non- Bayesian methods can also incorporate prior information in the form of postulated effect sizes in post-data design calculations . . . In short, we’d prefer to avoid hypothesis testing entirely and just perform inference using larger, more informative models.

But, we continue:

To stop there, though, would be to deny one of the central goals of statistical science. . . . there is a demand for hypothesis testing. We can shout till our throats are sore that rejection of the null should not imply the acceptance of the alternative, but acceptance of the alternative is what many people want to hear. . . . we think the solution is not to reform p-values or to replace them with some other statistical summary or threshold, but rather to move toward a greater acceptance of uncertainty and embracing of variation . . . we recommend saying No to binary conclusions in our collaboration and consulting projects: resist giving clean answers when that is not warranted by the data. Instead, do the work to present statistical conclusions with uncertainty rather than as dichotomies. Also, remember that most effects can’t be zero (at least in social science and public health), and that an “effect” is usually a mean in a population (or something similar such as a regression coefficient)—a fact that seems to be lost from consciousness when researchers slip into binary statements about there being “an effect” or “no effect” as if they are writing about constants of nature. Again, it will be difficult to resolve the many problems with p-values and “statistical significance” without addressing the mistaken goal of certainty which such methods have been used to pursue.

This article will be published in the Journal of the American Statistical Association, as a comment on the article, “Statistical significance and the dichotomization of evidence,” by Blakeley McShane and David Gal.

**P.S.** Above cat picture is from Diana Senechal. If anyone wants to send me non-copyrighted cat pictures that would be appropriate for posting, feel free to do so.

Ah, thanks to Diana for a great kitty picture! And, Diana, you seem to be a cellist and I must recommend my favourite cello sonata, that is, the first cello sonata by Alfred Schnittke! I love the third movement, and the cluster voicings in the otherwise tonal melody!

And of course thanks to Andrew for a good post. I got a bit derailed. “Ribet”, says the frog-like (froggy) creature.

Thank you, A carrot pancake (and Andrew too). I cracked up when I saw the photo with this post; it works.

Thank you for bringing up the Schnittke cello sonata. It is beyond me technically, but I love the Gutman/Lobanov recording.

Ha, I figured you’d know Schnittke’s sonata, but maybe someone didn’t know and they might be inspired to check it out since we talked about it. Webern is my favourite composer, right there with Schnittke, and he was a cellist too. You know the three pieces for cello and piano, opus 11?

Also relating to cello and Webern… I used to work in a library and they had a cat-alogue of Webern’s works and for some reason they’d included cello in the instrumentation of Webern’s concerto (op. 24). Being a music nerd I obviously noticed that as a mistake and notified the persons responsible for the catalogue. But even though it is in reality a cello-less piece, it’s still a wonderful piece of music and I’d recommend the 2nd movement to anyone!

I didn’t know Webern’s three pieces for cello and piano–listened to them twice just now, along with Lynn Harrell’s introduction (“Concentrate ferociously, ’cause it’s like a black hole”). Well worth the listening; I look forward to many more!

I should say that the student of mine and I submitted a paper in which we embrace the uncertainty. This was for the top journal in my field. We got a desk reject with the comment that our paper did not provide any closure. Just last week I reviewed a paper for the same journal, and this paper duly provides closure. I duly recommended that it be accepted.

Sorry, I’m dictating this text, and the software wrote embrace the uncertainty, when I said embrace uncertainty.

Closure, shmosure. Closure shuts out uncertainty, so has no place in most scientific work. Science is inherently open-ended.

Sure, I agree with you. What I’m saying is that embracing uncertainty will be a career killer. What I now do is to embrace uncertainty but don’t let the reader realise that. I rely on the power of pragmatics to lead the reader into thinking that I delivered closure. Oops, I just revealed my secret. Luckily, not many psycholinguists read this blog.

Sad that it is this way. I hope it will change eventually — the sooner the better.

Andrew, please could you provide an example for which “a low p-value is not necessarily strong evidence against the null”?

Chris:

You could start with the collected works of Satoshi Kanazawa and Brian Wansink, and at least one paper by Daryl Bem, and some back issues of PPNAS.

A few years back I would’ve said that if you read, say, 10 or 20 papers, probably at least one of them offers up a low p-value that provides no evidence against the null.

Then after a couple years of reading the blog, I would’ve said that if you read, say, 10 or 20 papers, probably most of them offer up a low p-value that provides no evidence against the null.

But now after like 4 or 5 years reading the blog, I’d say that actually most studies convince me that the null is not true. But only because the null is never true and we don’t even need any evidence – almost everything people study has lots of effects, and these effects vary in relative magnitude both across people and within people over time.

…that said, I’ve also learned about how easy it is to get a low p-value even in simulated environments with absolutely no treatment effect when you have a) multiple noisy measures of a set of outcomes; b) researcher freedom to explore specifications; c) high powered incentives for researchers to find a low p-value; and d) a shocking and perhaps shameful ignorance of what statistics is and isn’t (or can and can’t do) among empirical social science researchers. So you could take Andrew’s statement that way too, which is probably closer to how he meant it in this context.

Chris: With a large enough sample size, you can get tiny p-values when the null hypothesis is true. (Try a simulation yourself if you don’t believe me.)

I guess I messed up in what I said. What I should have said was: With larger sample sizes, you will be rejecting the null hypothesis with smaller effect size estimates than what you would get with smaller sample sizes. (Related to “larger sample sizes give higher power, and to the “winner’s curse” phenomenon that falsely rejecting the null tends to give inflated effect size estimates.)

Chris, translate Andrew’s statement from p-value talk into plain English, and that may help you construct examples. A low p-value means the result is rare or surprising under the null. For that not to be strong evidence against the null, it should be true that the result is also rare under reasonable alternatives. Construct an example with that property.

An example I’m partial to can be found in this review article, in the discussion of Fig 6:

Probabilistic Record Linkage in Astronomy: Directional Cross-Identification and Beyond

http://www.annualreviews.org/doi/10.1146/annurev-statistics-010814-020231

The problem is one of coincidence assessment, in this case, in directional statistics: You measure, with uncertainty, the directions to two objects on the sky, and the point estimates are near each other. Is this evidence for them being associated (sharing a common true direction), or is it merely a coincidence? Suppose the point estimates are close enough to have a small p-value under the null of a uniform distribution on the sphere, i.e., they are surprisingly close in great-circle distance. If the measurements are very precise, then under the alternative hypothesis that they are associated, it’s not surprising that they are close (indeed, it’s to be expected). But if the measurements have large uncertainty, then even under the alternative it would be surprising to have the point estimates be close to each other. As a result, the Bayes factor can only weakly favor association, even when the p-value under the null is quite small.

Andrew, thanks, I get it now, your referring to the garden of forking paths problem.

Chris:

Yes, but if effect size is small and estimates are noisy, then “p less than .05” provides very little information even for a preregistered study with no forking paths. Carlin and I discuss this in our 2014 paper. That said, forking paths play an important role here in facilitating the production and publication of such noisy results.

Martha, I’m not quite following your comment. If I am doing a simulation and not doing multiple unaccounted for comparisons then as I understand it the p-value by definition is uniformly distributed between 0 and 1 if the null hypothesis is true. Therefore, I should get a p-value of value p or smaller only p*100 percent of the time.

Andrew, thanks, I will take a look at that paper.

Chris,

The reply I intended to your comment here ended up above (below my original comment). Apologies for the double mix-up

Martha, thanks for letting me know. I think that p-values are OK in subjects like physics when you have large sample sizes and high signal to noise. But they don’t work so well in subjects like sociology where the samples sizes are usually small and you usually have low signal to noise. in areas like particle physics they have a 5\sigma threshold for discovery and so I think that generally gets around the problems Andrew is concerned about. But I understand that Andrew is interested more in the social sciences case and also agree Bayesian methods are very useful in physics particularly in cases when you don’t have large sample sizes and high signal to noise.

Chris, that’s correct but consider that in the 1:20 times (assuming no forking paths at all) most people would reject the null when it’s true. The p-value is still uniformly distributed in that <.05 range. It's equally probable that you'll obtain any individual p-value in the range <.05 (assuming continuous data). And, therefore, very low values don't have any more meaning than ones close to .05. Only the cutoff mattered.

Psyoskeptic, I am not quite following your point. I was referring to a 5 sigma cutoff where without forking paths the null hypothesis should be neglected when true only in about 1 in 3.3 million cases.

That doesn’t solve the issue of when the statistical null hypothesis might be non-0 even if the theoretical claim predicting the non-0 effect is not true.

Suppose you say that under theory A, light shouldn’t be affected by gravity. Then you observe that light behaves differently around large gravitational fields than it does in their absence (the two groups of observations – light moving with and without a large gravity source nearby – were unlikely to be generated from the same data-generating process). What has this experiment “proved” (even in the sense of probability statements)? General Relativity is (probably) right? That the previous models missed something important? That gravity-inducing objects also induce changes in the behavior of light waves/particles? That the methods we use to measure the movement of light through space are affected by gravity?

How about an older example: Suppose I believe that the planets are on a different layer of revolving spheres around the earth than the stars. I predict that, if they were on the same sphere, that Jupiter would be in location X on January 1st of the year 1186. Under my preferred model, I predict it will be in location Y. I then confirm it is in location Y, and can rule-out it is in location X. Have I proved my theory of celestial-sphere-nesting?

I’m sticking with physics-y examples because they are useful ways of pointing out that “rejecting the null” does not ever really tell us much about the world, even under “ideal” conditions. Sure, it can help us realize certain features of the world are not consistent with certain models of it. But the usefulness of the exercise comes from the quality of the research design, the strength of the novel theoretical prediction made by one (theoretical, not statistical) model relative to another potential (theoretical) model, and the relationship between the predictions themselves and the tests of them in the world. Even in physics these steps often fail – in social science the failures are even more obvious: samples and treatments are convenience-based; theoretical models don’t make precise predictions that can be tested against each other; and we can almost never measure the actual thing we are interested in (intrinsic motivations, feelings, attitudes, preferences, behavioral trade-offs, thoughts).

So in general – even if researchers (in physical or social sciences) do everything “perfectly” according to statistical methods, a low p-value rarely tells us much about the world. It often just tells us a lot about how one researcher was able to interpret their results according to their preferred (and pre-approved by Science) metaphysics of existence.

Now, I understand that this doesn’t relate to rejection rates in purely idealized experimental settings. And you are right in general that p-values can, under certain conditions that are unrelated to sample size or variance (given some min N and max V), give us appropriate rejection rates.

But that means very little epistemologically. And when you say things like “the null hypothesis should be neglected when true only in about 1 in 3.3 million cases” then I think that statement is either intentionally or unintentionally conflating rejecting the statistical null hypothesis (some two groups of observations come from the same DGP) with some theoretically-motivated alternative hypothesis (something that could be “true” in the sense of scientifically meaningful and interpretable). I mean, sure, we could say it is “true” that the alternative statistical hypothesis may be favored over the statistical null, but then all we are saying is “there is probably a difference between these two groups.” We don’t get to say anything about why unless we can show that under no other theoretical model this thing could happen. But how many explanations exist for, say, the black-white earnings gap; or the differences in educational attainment across countries; or the cause of schizophrenia; or the nature of the transmission of knowledge across people? And how many of them are actually identifiable from each other in terms of concrete predictions about the world? I mean, you could literally give the same sets of results from some social science study and have 10 groups of researchers give 10 interpretations, each within their own metaphysical frameworks and each one equally consistent with the data.

tl;dr: 5 sigma thresholds make it more difficult (read: expensive) to detect differences between groups. This, though, does not solve the most important issues in scientific inference, and the false-promise that it does is itself part of the problem. The statistical null is a stupid object, and the rejection of it via p-values it tells us very little.

Yes, LIGO is a great example. The p-value they report does not help at all in distinguishing between some kind of atmospheric effect, something from the sun, national power grid fluctuations etc and a gravitational wave.

They claim they ruled out all other plausible reasons for such a signal (and an insane amount of effort was put towards doing this), and that may be true. But let’s not give undue credit to the “rejecting chance” step in the process.

jrc and Anoneoid, I think, at least in physics and other similar sciences, one does need some kind of convention for what is a discovery. So I can’t think of a particularly good alternative to p-values or the often used equivalent of seeing how many sigma’s one is from the null value. But one other convention which is also generally followed is that the discovery has to be made by more than one experiment. So for example in the Higgs discovery, it was crucial that both the Atlas and CMS experiment had seen a 5 sigma detection.

“I can’t think of a particularly good alternative to p-values”

Bayesian probability distributions over parameters in a well developed mechanistic model for the result.

Suppose you have some coupled climate / ecological model for desertification or forestation or some such process. Which would be more convincing to you that the process is actually occurring at location X

1) At location X changes since last year are 5 sigma away from changes seen at 100 random locations on the face of the earth, but you have no mechanistic model?

2) When fitting your mechanistic model to 15 years of historical measurements in the vicinity of point X a parameter which implies desertification is the asymptotic result whenever it is greater than 1 has posterior 95% high probability density region 1.13 to 2.27

The only way it could be crucial is if every other possible explanation for any deviation from the null model had been ruled out. If that is not the case, the p-value has surely been misinterpreted.

Also please remember the problem most people have is not really with the p-value. It is with choosing a null hypothesis that nobody believes.

I see two aspects to the problem. First, a statistical evaluation, however accurate, isn’t enough. People are looking for the statistician to help them make a decision. So they really need risks associated with decisions; or even an expected value analysis. In many published papers the decision is just between “keep looking for an effect” or “spend your time and money on something else.” Second, people are generally not very good at understanding probabilities…especially ones close to 0 or 1. Having them understand a risk analysis if even more fraught.

Daniel, the way p-values are used is often with a mechanistic model which evaluates how often a test statistic would have a value equal to or larger than the observed value if the parameter of interest was set to the null value. But, I prefer your suggested confidence interval approach and I actually mentioned it in the second half of the sentence you quoted, although I should have wrote “almost equivalent” rather than equivalent. Although, if I understand Andrew’s post correctly he also objects to the confidence interval approach under the heading “Confidence intervals instead of hypothesis tests”.

Anoneuoid, it’s crucial to have more than one experiment confirming a result as a check if there are unaccounted for systematics. This is something that is needed for all statistical procedures, not just p-values.

Sure, I agree 100% on the need for independent replication. These huge physics experiments aren’t even really independent enough for my taste, but practical issues really do get in the way there. But imagine if they both saw “5-sigma” signals, but at different mass… clearly it is not the statistically significant deviation from the null model that is crucial.

Anoneuoid, I agree it was important that the masses were consistent within the errors. I think that in addition to a significant deviation from the null was needed.

Right, rule out chance, rule out systematic detector errors, rule out problems with the filtering algos, rule out other proposed particles. These are all needed. Why should ruling out chance have a privileged position over any other explanation besides the Higgs? It makes no sense unless you misunderstand what the p-value is telling you.

I agree testing for other explanations is important but I don’t think that implies it is also wrong to test for chance.

It isn’t wrong to rule out chance, but there is no rational reason to focus on ruling out that vs any other explanation. In fact, that is the least interesting thing to rule out.

Instead what we see is headlines about “chance ruled out with 5-sigma confidence!”. This is a sign of misunderstanding and that the people involved are likely to come to incorrect conclusions.

Anon:

Pages 61-62 of this classic discussion by I. J. Good are relevant to this point.

Anon, well I agree that the headline “chance ruled out with 5-sigma confidence!” is not strictly correct as the 5 sigma threshold does not account for garden of forking paths or systematics below a level of about 1 sigma. I think the main point is that it is a threshold which has proven in a vast number of examples to be reliable in detecting discoveries in particle physics. As we agreed earlier it does require a confirming experiment as well as sometimes there are systematics larger than the 1 sigma level that have been incorrectly not accounted for.

While these issues also exist, that is not what I was talking about above. Instead it was that there are always multiple explanations for a given observation and the scientists job is to winnow down the possibilities. Ruling out “chance” is only one minor part of this but gets all the attention. This seems to be because 1) it is the easiest thing to rule out, and 2) people are confused/wishful about what the p-value means (ie they think it somehow corresponds to the probability their theory is correct).

Anon, yes that is an unfortunate misinterpretation of p-values that is sometimes made. Although, I am sure not by the people who wrote the Higgs discovery papers.

Is the McShane and Gal paper available anywhere?