Abandon Statistical Significance

Blake McShane, David Gal, Christian Robert, Jennifer Tackett, and I wrote a short paper arguing for the removal of null hypothesis significance testing from its current gatekeeper role in much of science. We begin:

In science publishing and many areas of research, the status quo is a lexicographic decision rule in which any result is first required to have a p-value that surpasses the 0.05 threshold and only then is consideration—often scant—given to such factors as prior and related evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain. There have been recent proposals to change the p-value threshold, but instead we recommend abandoning the null hypothesis significance testing paradigm entirely, leaving p-values as just one of many pieces of information with no privileged role in scientific publication and decision making. We argue that this radical approach is both practical and sensible.

Read the whole thing. It feels so liberating to just forget about the whole significance-testing threshold entirely. As we write, “we believe it is entirely acceptable to publish an article featuring a result with, say, a p-value of 0.2 or a 90% confidence interval that includes zero, provided it is relevant to a theory or applied question of interest and the interpretation is sufficiently accurate. It should also be possible to publish a result with, say, a p-value of 0.001 without this being taken to imply the truth of some favored alternative hypothesis.” We also discuss the abandonment of significance-testing thresholds in research and statistical decision making more general. Decisions are necessary, but a lexicographic rule based on statistical significance, no.

P.S. The adorable cat pictured above sees no need to perform a null hypothesis significance test.

114 thoughts on “Abandon Statistical Significance

  1. Really like these phrases:

    “zero effect and zero systematic error”

    “traditionally neglected factors”

    “multilevel or meta-analytic in nature”

    “continuous and inevitably flawed learning that is accepting of uncertainty and variation.”

    Though perhaps still too much a default emphasis on (desperately) trying to discern something meaningful from a single (isolated) study.

  2. Where will this be submitted?

    We recently got a revise and resubmit wherein the reviewer demanded that we produce 95% credible intervals (instead of our 90% credible intervals) to make the results more consistent and comparable with 95% confidence intervals and 0.05 test levels.

  3. While I think there’s plenty to be learned from frequentist modeling, a benefit of MCMC approaches is that there isn’t really a p-value and we are left to our own devices to figure out how to interpret the uncertainty. Easier Bayesian software, things like rstanarm for R, might be a trojan horse for the abandonment of p-values.

    • Valentin:

      I especially like the points about not trying to discern something meaningful from a single (isolated) study – “To avoid perpetuating problems caused by discrete decision rules applied to single studies, … Reliable scientific conclusions require information to be combined from multiple studies and lines of evidence. To allow valid inference from literature syntheses, results must be published regardless of statistical significance, …”

      • Valentin,

        I really liked your article! Indeed, the point about scientific inference being about information COMBINATION is the critical shift in perspective that we need today. It implies that decision criteria should be eschewed, and that multiple evidentiary measures (and original data) should be reported, so that results can be more easily combined.

        I’m really chuffed that there’s so much activity from so many great thinkers on this issue now!

  4. Well reasoned and well written, as always. So too with Amrhein and Greenland’s commentary and the discussion between Lash and Greenland re: the need for cognitive science in methodology. Yet I, and I suspect a good many of your readers, live alas in a less than ideal world where irrevocable decisions must be made on the basis of what scientists publish in the literature. For defense lawyers then, dichotomization may be wrong but it’s often useful.

    Yet now I’m beginning to hear your valid objections turned into the following sort of argument:

    Your honor, all that is required under the law is a showing of more likely than not and these p-value significance thresholds have been demonstrated to be completely arbitrary by our brightest thinkers. Rather, viewing them as a continuous measure of evidential value is the enlightened approach. So admit into evidence any association with p < .5 and let my experts expound upon why biological plausibility (at least in their well paid minds) and inference to the best explanation (Objection your honor, he means "best of a bad lot") makes my claim that coffee caused my client's cancer not just a possibility but a near certainty.

    I've already witnessed one judge accept this sort of argument.

    In thinking this through (as best I can) it occurs to me that to get at the heart of the matter it would do well to have answers from Andrew, Sander, Deborah, et al. to the following question:

    If you were put on trial for a crime you did not commit, one that carried a long sentence, and the evidence against you was purely of a statistical nature, by what criteria would you demand that it be assessed and weighed? Or do the same for something less scary like "take away all your money, your house and garnish your wages" if incarceration seems too far fetched.

    tl;dr If we throw down the only barrier to noise mining, the cause of justice (though not the net worth of lawyers) will suffer; unless there's something I'm missing.

    • Thanatos:

      As discussed in our paper, I fully recognize that decisions should be made, and I think this should be done based on costs, benefits, and uncertainties, not on p-values. I don’t see the “If you were put on trial for a crime you did not commit” question to be so relevant for decision problems, in part because scientific hypotheses are not in general like “crimes” that either happened or didn’t, and in part because you’re conditioning on “you did not commit,” but in science you don’t know the underlying truth.

      • And all scientific conclusions are made (or should be) with reasonable doubt and with expectation they will later be overthrown.

        I believe it is paramount to work out how to better “cause to understand” from observations subject to haphazard and systematic uncertainty without much concern for legals scholars need to figure out how to better “cause to agree with these facts and their legal implications”.

        Hopefully some legals scholars will be up for the challenge.

        • Hear, hear. I’m not sure how anyone could ever discover anything useful or communicate what they had discover if they had to anticipate future lawyers trying to confuse a jury with it.

    • This is an interesting and clearly important perspective. Could you explain the part where p-values are “the only barrier to noise mining”? My understanding is that they’re among the biggest enablers of noise mining.

      I’m also fairly sure I don’t understand the argument your semi-fictional lawyer uses. Does the argument just boil down to, “Smart people disagree, therefore essentially anything could be true”? If so, I’m pretty sure a similar argument could be constructed around literally any method of acknowledging uncertainty in scientific publications.

      • > Could you explain the part where p-values are “the only barrier to noise mining”? My understanding is that they’re among the biggest enablers of noise mining.

        +1 signal

        re: the original question.

        Not a full answer, but for a start I’d like to be tried by a jury of my peers who are presented with a range evidence, debate and discuss amongst themselves etc and then have a judge weigh the costs and consequences of incarceration against the weight of evidence.

        Instead of, say, being assessed by a random number generator…

      • Let me revise my statement in light of your clarifying reply. It seems to me that if we remove the only hurdle to declaring a study able to say whether a particular fact (an element of a causal claim) is more or less likely to be true (and thus admissible), however easily hacked p-values may be, there’ll be no common ground upon which to debate the evidential value of a study.

        In every other context there’s something upon which the arguments are grounded. If an expert says “plaintiff has acute non-lymphocytic leukemia” there are WHO standards on which everyone (almost) agrees and from there we argue risk factors. If an expert says “the shear forces on the wing were such and so” we argue force and not the existence of shear. But if an expert hired to find p<.5 in a neighborhood claiming the cancers on blocks C, G, L, P and Y are probably caused by local landfill X finds it via p<.2 for toe cancer in Aleutian Islanders who play soccer on Saturday to what authority do I turn to call B.S. on it? I understand Greenland's "against method, sort of" stance but it's naïve. We live in a world in which living creatures (like lawyers) exploit their environment for personal gain, and pretending/hoping it isn't so is even more naïve.

        If the answer is "we can't agree on what does or does not constitute minimally sound statistical inference" I ask in reply "then how can you possibly add to the sum of human knowledge?"

        • preponderance of the evidence would suggest something more like “from among all of the possible explanations that anyone has ever thought of for X the one that is relevant in this case has 50% posterior probability or more”

          this is very different from a p value which would tell you “my particular random number generator might generate numbers whose test statistic is this extreme or more about 50% of the time”.

        • Thanatos:

          You ask if we can “agree on what does or does not constitute minimally sound statistical inference.” I can’t force other people to agree with me. For example, based on this series of events, I think there is essentially zero common ground between me and the editorial board of the Association for Psychological Science on what constitutes evidence. I also can’t see myself coming to any agreement with Satoshi Kanazawa regarding statistical evidence. So, if agreement is your goal, I think you’ll never get there. Nonetheless, I do feel that in my applied work I have added to the sum of human knowledge.

          Regarding your larger question: yes, of course there are lots of ways of putting together statistical evidence. For a start, take a look at my two textbooks which are full of examples, or for that matter Red State Blue State which has lots more, without a p-value in sight. For that matter, I’ve also consulted on legal cases and helped companies make business decisions.

          I don’t say significance testing is useless: problems do come up where it is worthwhile to note that a pattern in data could be explained as an output of a particular random number generator. But I see no good reason to privilege this particular sort of statement.

          Regarding your question about toe cancer: I’m no cancer expert but speaking generically I recommend studying all relevant comparisons and fitting the data using a multilevel model, rather than picking out one or two comparisons out of the many things one could reasonably study in a dataset. So, sure, you can call B.S. on that sort of arbitrary analysis, not because of the p-value but because it does not make use of all the data and is subject to obvious selection bias.

          Just as an aside: I use Bayesian methods and wrote a book on the topic. Nonetheless, I don’t kid myself that Bayesian methods are necessary or that, without Bayesian methods, it’s not possible to add to the sum of human knowledge. I recognize that there are many roads to Rome.

        • > “we can’t agree on what does or does not constitute minimally sound statistical inference”
          That is what I see is being worked through currently in the statistical discipline – “guess what guys, we are not agreeing on what does or does not constitute minimally sound statistical inference” – but we seem to agree its not this, that and the other version of defining and interpreting p_values, Bayes factors, confidence/credible intervals (containing not containing zero) etc.?

          So we are in the process of unfreezing (gee a lot of textbooks, courses, commonly accepted practices need to be thrown out).

          Then of course we need to re-freeze to some common agreement on what is minimally sound statistical inference and how would we recognize it. And then a some point teach it to others.

          > “then how can you possibly add to the sum of human knowledge?”
          As Einstein put it – “the eternally incomprehensible thing about comprehensibility” but then somehow we guess our way there.

        • “Unfreezing” and “re-freezing.” I like it (though perhaps the original freezing was part of the problem…).

          Can I get a show of hands for who will be at the related ASA symposium in Bethesda next month? I seem to recall our host here is on the agenda.

        • Thanatos: Up to your usual lawyering in “I understand Greenland’s ‘against method” sort of’ stance but it’s naive”? Insinuation but no illustration of what I actually said that was naive.

          The reality is that I’ve been harping on the point that scientists like lawyers exploit their environment for personal gain. That problem is well recognized in the “replication crisis” literature under the topic of perverse incentives to hack for “significance”. I’ve merely added that I see researchers hack for nonsignificance too, when that is in their interest (e.g., to protect colleagues or funders from liabilities, or shore up pet theories against refutation, or merely to get attention by going against earlier “significant” results).

          Your comments are the ones that are naive with respect to both science and the law. Here’s why: The rest of us here understand that science is not law and is not supposed to be reaching judgments or decisions with anything like legal finality. Evidence may eventually reach the point at which the science community accepts an explanatory theory as more or less permanent fixture and thus recommends its use for actions (as with smoking and health policy), but that is not the same as the finality and consequences of a legal verdict – a verdict which is far more difficult to approach with formal theory than any statistical or engineering problem I’ve seen.

          Wise verdicts will use science but need much more (e.g., responsibility issues), which is to say justice is nowhere as near automation as (say) car driving. Science isn’t either – turning science into a sausage machine for manufacturing inferences has been a disaster well illustrated by the ravages of NHST. Some of us who have seen the horrorshow of certain expert reports (both defense and plaintiff) can testify that science is being corrupted by pressures to act like law and reach verdicts to the benefit of one side over another. Trying to turn science into a machine for manufacturing legal verdicts (as you seem to be demanding) is thus the road to an even bigger disaster for science, justice, and the law.

          The methods issues science confronts are difficult enough; those for law far worse since courts do have to reach decisions. To do better than the current system would require a lot of legal readjustment and experimentation which, let’s face it, are less likely to happen than agreement about statistical methods. It’s already been noted that, at a minimum, the practice of each side (instead of the court) hiring the expert witnesses is a major source of the problem, since it creates massive interest conflicts for all the expert witnesses. My view is thus that tort reform would start with each side’s expert limited to consulting to the lawyers, and testifying experts instead hired by the court after checking that none have COIs in the case (with each side having some preliminary questioning and veto power, as with juries, and having to bear the cost of paying the witnesses for their time while being forbidden to communicate without the court’s supervision). There was a start in this direction at the federal level but it seems to have fizzled. If you are really concerned about science in the law, try resurrecting that solution.

        • “Trying to turn science into a machine for manufacturing legal verdicts (as you seem to be demanding)”

          I think you misunderstand his concern though. I read this as he would like to have science STOP MANUFACTURING CERTAINTY so that Lawyers can’t use this false certainty to get unjust verdicts. Part of stopping this manufacture is to come up with new criteria for deciding when a scientific assertion approaches the level of “very probably true, enough so that verdicts which assume it is exactly true are not unjust”.

          that level is far above “Someone somewhere once published something with p less than 0.05” which seems to be the current standard in the law.

          It is, in fact, the same kind of standard mentioned by “Laplace” in a thread that was revived in the last few days:

          http://statmodeling.stat.columbia.edu/2016/11/28/30645/#comment-570997

          We take a thing to be “true” when the uncertainty involved is small enough that the decisions we ultimately make are not materially affected by the fact that the unknown thing might take on any of the values in the posterior interval.

          In the absence of that “delta function like” narrow uncertainty interval, we instead need to take a view that the thing *is* uncertain, and then to do a good job a decision must explicitly acknowledge the uncertainty, and weigh the pros and cons of making the decision. Formally this is done in Bayesian Decision Theory.

        • I’ll let my copy of Modern Epidemiology and a 2+ decades-old redweld full of your papers (from which I have learned much and have been disabused of much) stand as testament that I meant no slight.

          What I inartfully attempted to convey was frustration with (a) what Daniel wrote; the attempt by courts to turn your measures of a model’s uncertainty into certainty about a competing (and usually untested) conjecture; and, (b) the increasingly common claim sometimes made by people citing Rothman/Greenland (especially when we get to the execrable – The Court: “Counsel, let’s now go through the Bradford Hill causal criteria analysis” – phase of the proceedings) that either methods don’t matter or that no method is any more likely to ferret out false claims than any other. The result in at least two cases have been courts deciding that causation can be discovered and justice thereby done merely from a credentialed scientist’s assessment of biologic plausibility alone.

          What I hoped to get from you assembled scientists was (recalling the witch in Monty Python’s The Holy Grail) agreement as to whether there exists some minimal level of testing of hypothetical statistical claims made against you that would if the test(s) was passed cause you to say “it’s a fair cop”. As for the other points you raised:

          Paragraph 3 (yours): Since the ASA statement on p-values was published there have been more than 500 published legal opinions and orders that contain “statistically significant”, “confidence interval”, “statistical power”, etc. Children are being taken from their parents, prisoners executed, fortunes won or lost all on the basis of reasoning like this from a recent federal appellate court opinion: “The p-value quantifies the statistical significance of a relationship; the smaller the p-value the greater the likelihood that associations determined in a study do not result from chance.” I’m not saying you’re responsible for this mess. I’m saying you have the street cred to help clean it up. To that end, if you’re asked to help on the next version of the Reference Manual on Scientific Evidence please consider it.

          Paragraph 4 (yours): Responsibility in the context of duty (along with the other aspects of wise judgments) was supposed to be parked in the definition of “legal causation” (which was grounded in public policy theory not too distant from that of public health) but courts began to ignore it 40+ years ago when they became convinced that but-for causation was all that was needed and NHST could find it. Thus the spectacle of the administration of justice via NHST. Thus, I’m in complete agreement.

          Finally, as to your fifth paragraph, (because witches), you’ll be pleased to know that the famous jurist Learned Hand made your point in a law review article 116 years ago. In his survey of the use of expert witnesses at trials he covered a number of instances of their use before juries including in “The Witches Case” 6 Howell, State Trials, 697 – “Dr. Brown, of Norwich, was desired to state his opinion of the accused persons, and he was clearly of opinion that they were witches, and he elaborated his opinion by a scientific explanation of the fits to which they were subject.” Hand was troubled by the uses to which “science” had been put but more importantly troubled by the role of experts.

          Expert witnesses were supposed to testify to “uniform physical rules, natural laws, or general principles, which the jury must apply to the facts.” That meant that the jury was largely displaced from its usual role of supplying the major premise (drawn from common sense and common experience) to which the admitted testimony of witness was applied but IFF they believe the expert witness. That in turn drew the focus of the lawyers to the credibility/likeability of the expert rather than the soundness of the claim he/she was making. This is how we wound up with (true story) a highly credentialed, bright, quick and deadly expert witness being paid $2 million per year NOT to testify against a certain party and group of lawyers. Anyway, here’s what Hand wrote in 1901 about paid experts:

          “The expert becomes a hired champion of one side… Enough has been said elsewhere as to the natural bias of one called in such matters to represent a single side and liberally paid to defend it. Human nature is too weak for that; I can only appeal to my learned brethren of the long robe to answer candidly how often they look impartially at the law of a case they have become thoroughly interested in, and what kind of experts they think they would make, as to foreign law, in their own cases…. It is obvious that my path has led to a board of experts or a single expert, not called by either side, who shall advise the jury of the general propositions applicable to the case which lie within his province.” http://www.jstor.org/stable/pdf/1322532.pdf?refreqid=excelsior%3Ae67b3a059b1c64f4d63f7b6fbbc5018b

        • At the same time, imagine what damage could be done by “regulatory capture” of the list of “allowed” experts who can inform the court… I don’t think the solution of the problem of expert witnesses is to stop lawyers from hiring opposing ones, though I could imagine adding an additional one hired by the judge/court might be of help.

        • An excellent point (and one that resonates given my experience with the FAA and the capture of its employees at a local tower by an flight outfit that likes to perform 200′ flying and acrobatics over my house, a host of regulations notwithstanding). In most cases having dueling experts works fine. One expert says “the wreck was caused by speeding”, talks about skid marks, and there’s no debate about whether friction is a real phenomenon. The other says “given the damage to the impacted vehicle the speed at the time defendant applied his brake was less than the speed limit”, says F = M*A, and A. B. Hill is not summoned via Ouija board to say whether or not force applied to sheet metal can cause it to be dented.

          Perhaps it’s this business of thinking about variable phenomena in a language more suited to discussing gravity, the charge on an electron, etc. that’s at the root of the problem. Maybe when Fisher read Student’s paper and saw how well his empirical test of 750 randomized N=4 finger length samples fit his newly discovered curve he thought “well, it ain’t E=Mc^2 but there is an underlying law here!” Whereupon he immediately fell into reification’s trap because he had no solid theory/philosophy of modeling. I’m hoping Andrew’s new book will devote a page or two (maybe it’s already in the old book but I’m not buying it to find out since the iDataAnalysis Model X is rumored to be coming out soon and I’m cheap) to the fundamental issue of “how (to paraphrase Student) may a practical man reach a definite conclusion from all this modeling you’re going on about, Andrew?”

        • > Whereupon he immediately fell into reification’s trap because he had no solid theory/philosophy of modeling.

          I mean, Fisher was far from perfect, but don’t forget Fisher’s major contributions to the modern synthesis of genetics and natural selection using…mathematical modelling!

          He has at least one PDE (in various forms) named after him: https://en.wikipedia.org/wiki/Fisher%27s_equation

        • Thantos:

          Statistics should be about grasping the real uncertainties from what we observe and you seem to be requesting some certainty about that uncertainty (e.g. uniform physical rules, natural laws, something about general principles?) and very likely is just not possible.

          Unlike commonly accepted accounting practices, adequate for this purpose Newtonian physics, etc. I do not think those are in the foreseeable future for statistical practice. But as I said earlier “That is what I see is being worked through currently in the statistical discipline” so I might be wrong.

          On the other hand, science as hard as it is, does not need to meet the additional requirements of the courts – so I think it is best to start there. If we can sort out what constitutes minimally sound statistical inference in science then maybe that can be upgraded to meet those additional requirements. And maybe not.

          However, the idea of a court appointed expert, perhaps to arbitrate between the competing experts seems a far more promising route. I think you have already convincingly pointed out that what ever sensible stuff a statistician might write – it will be naively or purposely be twisted into something very different that will do a lot of harm!

        • Thanatos: Thanks for the detailed reply. I don’t see where we are disagreeing on matters of logic, and perhaps not even on matters of law. But as a test I’ll add these potentially controversial observations of mine:

          1) I can’t fathom why some still defend having testifying (as opposed to consulting) experts hired by the opposing parties, which easily creates situations in which the majority of unbiased expert opinions get excluded because they don’t fall clearly enough on either side. The goal as I see it should be to de-bias expert testimony as much as possible and get a fair representation of what is out in the field, given that the triers of fact will not have the expertise to do so very effectively. Your Hand story seems to support my view of this problem. And court experts could help address the problem of absurd claims getting filed: the availability of court experts early on should help get more of these meritless cases thrown out early, and could discourage such filings if the plaintiff has to pay the cost of this process when complete lack of scientific merit is determined (and make such a determination more sound than a court could achieve). Each party might recommend experts to the court, but again all testifying experts need to be screened by all parties for COIs and other bias sources (such as connections to the parties).

          2) Regarding Hill: Like P-values, his list takes a lot of misplaced blame for abuse in the hands of incompetent or biased users. To be sure the list is far from perfect (Mod Epi goes through the list critically) but like a chainsaw it can be used constructively (to remind one of relevant items to check) or for butchery (if taken as a set of necessary conditions, as I often see defense do). There have been many proposals to update it but none I’ve seen result in dramatic shifts, and in my view only support the idea that it was quite a nice summary for 1965 when relative risks of 10 were being contested by industries. This says to me such updates as needed are for extension to our current, far less certain RR of 2 or less era, not for philosophic subtleties. And that means modern causal analysis tools will enter into its application.

          3) There is a fundamental asymmetry I’ve observed between defense and plaintiff lawyers in torts: Plaintiff lawyers search for cases they can win, which for the best means at the outset they hire consulting experts to judge whether the science could actually support causation, before entering the arduous and expensive litigation process. Defense is more reactive, having to defend in response to plaintiff claims and build a case against causation regardless of what the actual science shows (even if only to minimize final judgments or drag on the case to bankrupt plaintiffs, as I’ve seen happen). This asymmetry is incomplete: again, there are plenty of absurd claims filed but again the availability of experts for the court should help get more of these thrown out early, and discourage meritless filings if the plaintiff has to pay the cost of this process when that happens.

          4) Given the inertia that has greeted proposals and mechanisms for court experts, we need effective if less ideal solutions to the current mess. The Reference Manual on Scientific Evidence is one and my impression is it has helped quite a bit, albeit in a limited sphere. I think it needs updating with coverage of cognitive biases, with guidelines to help courts detect not only bias in expert opinions, but also to help de-bias their own judgments.

          That said, I was puzzled by your comment about helping with the Reference Manual on Scientific Evidence, insofar as I’m not one known to be shy about sharing my opinions and criticisms (Andrew even blogged humorously about that very fact) – so if they ask they’ll get plenty. Having been however at the 2003 San Diego conference on science and the law that fed into the latest edition and not contacted since, I won’t hold my breath for that.

      • P-values are a barrier to noise mining in the sense that we often think we see some kind of trend and at least want to consider whether we could have seen that solely by chance alone without there being anything interesting going on. This is something I run in all the time, when people look at early data of new drugs. Statements like “The one patient getting the new drug had an improvement of 100 on this variable that has a standard deviation of 500, the single placebo patient had a worsening by 200 and as a result we have an effect of 300, which is way more than the best drug out there that achieves an average difference of 50. We must immediately rush an develop this amazing drug for the benefit of all of humanity.” are surprisingly common.

        Obviously, p-values are not necessarily a good way to deal with this, but at least there is some kind of barrier. Ideally, you’d really trade-off utilities (potential benefit from the drug vs. cost of future resources invested and lost opportunities to research other drugs), but that requires a lot of effort and is rarely done. Don’t get me wrong, I am not saying that p-values are a good approach here, but more that something may be needed to deal with people getting excited about any trend they spot.

        That’s particularly the case once researcher really want to believe and have lots of opportunities to convince themselves by repeated looks at the data, considering multiple outcome variables, different assessment times (after 1 week, after 2 weeks, after 4 weeks of treatment etc.) or subgroups of patients.

        • Bjorn:

          In response to your comment, here’s what we say on the last page of our paper:

          While we see the intuitive appeal of using p-values or other statistical thresholds as a screening device to decide what lines of research—for example, ideas, drugs, or genes—to pursue further, fundamentally this approach does not make efficient use of data: there is no general connection between a null hypothesis-based probability and the potential gains from pursuing a potential research lead or even the predictive probability that the lead in question will ultimately be successful.

          Instead, to the extent that decisions do need to be made about which lines of research to pursue further, we recommend making such decisions using a model of the distribution of effect sizes and variation, thus working directly with hypotheses of interest rather than reasoning indirectly from a null model. We’d also like to see—when possible in this and all other settings—more precise individual-level measurements, a greater use of within-person or longitudinal designs, and increased consideration of models that use informative priors, that feature varying treatment effects, and that are multilevel or meta-analytic in nature.

          No need for a lexicographic rule based on null hypothesis significance tests. Based on your comment (“Obviously, p-values are not necessarily a good way to deal with this”), I think we’re basically in agreement. What I’d like people to do in such settings is fit multilevel models, which in fact can be much less susceptible to exaggeration and hype than separate p-values; see this paper. But, sure, not everyone will suddenly start using multilevel models. So, short term, I accept that people will mix all sorts of evidence, including p-values. That’s the way it is. Still no reason for a lexicographic threshold. I see all sorts of reports in real applications where authors and editors make real mistakes by using null hypothesis significance testing to get excited about any trend they spot. Two examples are discussed in this paper, where I argue that these problems are particularly serious in incremental settings (which includes a lot of drug development!) where underlying effects are small and data are noisy.

    • A man asks his friends to each measure their heads for hat size in inches. Each one does and writes it in a notebook. Later this notebook is subpoenaed A statistician is asked to testify. A statistician shows that these numbers or more extreme occur when a certain random number generator is used with p=0.50 and then testifies that there is a 50% chance that the numbers were not measured and just generated by the RNG so the friends never were together at their house and therefore could have been free to commit a certain bank robbery….

      • In case its not clear here, the point of this story is that just because someone can think up a random number generator such that some given numbers are not unusual as its output does not mean that the numbers probably DID come from that RNG. By the same token, if you can think up a mechanism by which it’s reasonable to cause cancer, it does not mean that is the mechanism that did cause a particular persons cancer, etc. Only by examining all known possibilities and giving each one some plausibility do we find out what we really know about the truth of a complex scenario.

  5. I’m curious how this would apply in physics to questions about the existence of a novel fundamental particle or a new law of nature.

    I first became seriously aware of Bayesian statistics in the early 1990s when controversies over a fifth force of nature were all the rage and Philip Anderson wrote a column in Physics Today about using Bayesian hypothesis testing (basically using information criteria, although Anderson did not use that vocabulary) to distinguish between the case where a novel quasi-gravitational force existed, but with a small parameter characterizing its strength, and no such force (i.e., a sharp point null hypothesis). Anderson argued in that column that applying Bayesian methods in the context of a sharp point null hypothesis was a powerful tool for rejecting false hypothesis and avoiding getting suckered by wishful thinking when effects are small and measurements are noisy.

    In contrast to what you write about neuroimaging, the fifth force almost certainly does not exist, and is thus a “true zero.” It does not seem reasonable to say, “there is almost certainly some nonzero fifth force effect, although the accumulated evidence is consistent with it being very very weak.”

    I also consider how one would think about whether the Higgs boson or a magnetic monopole exists without some version of null-hypothesis significance testing with a sharp point null hypothesis.

    I read your paper and can see how to apply its ideas in many contexts, but am confused about how one would apply them to questions about the existence of the Higgs boson or a fifth force of nature. Everything you say about how there is more to consider than just p-values makes total sense. But in the end, the kinds of questions that I pose here—whether something exists or not—end up being about accepting or rejecting a null hypothesis, even if one may be using many criteria in making that judgment.

    I cannot see how one would replace the traditional strict dichotomy between “exists” and “does not exist” with a more continuous paradigm.

    • Jonathan:

      There are some problems where discrete models make sense. You give the example of a physical law; other examples are spell checking (where, at least most of the time, a person was intending to write some particular word) and genetics (to some reasonable approximation). In such problems I recommend fitting a Bayesian model for the different possibilities. I still don’t recommend hypothesis testing as a decision rule, in part because in the examples I’ve seen, the null hypothesis also bundles in a bunch of other assumptions about measurement error etc. which are not so sharply defined.

  6. I read the paper and I think it is on the right track. Indeed, I teach significance testing (you have to; it’s too ubiquitous not to) much the way the paper recommends. I also avoid using NHST as the sole tool for reaching conclusions about relationships unless the model used for the data allows it. Indeed, I seldom refer to the Neyman-Peason model.

    But two matters bother me. First, I’m a frequentist and I’ve been much influenced by Virginia Mayo’s take on how to make scientific decisions using statistics. How would her notion of severity tests work in the absence of tests that allow us to evaluate a particular model by a test that allows us to assess the probability that a particular hypothesis agrees with the data, if the hypothesis is false or incorrect. (I think I have her right there.) This way of approaching things doesn’t depend on NHST in particular – it depends on the model which test is used – but some tests are needed if we are going to put the data to a severity test. I can see how her views might fit with those of you and your colleagues, but I’m guessing that I’m probably missing something. Besides you being sorta a Bayesian.

    Second, I also emphasize the use of DAGs as a device to parse out causal sequences in observational data. I’m a political scientist and that’s often all we have; many of the quasi-experimental methods either can’t be applied or don’t allow us to make sense of what’s going on. (I’m thinking about the frequent abuse of instrumental variable designs here.) But any DAG that is more exploratory usually has some edges that don’t represent statistically significant relationships and the usual course is to simplify the picture by eliminating those, provided it doesn’t lead to any specification problems. This isn’t a universal condition but it is how a lot of the practical work is done. Query: if we don’t have significance tests, how would this be accomplished? Or is this one of those areas where significance tests remain useful, if the model for the data makes them appropriate?

    I’m pretty sure I’m missing a lot here. I’m not a statistician or a philosopher of science. Any light that can be shed would be handy.

    • Tracy:

      1. Frequentist doesn’t have to be significance testing. For example, this paper is frequentist, as is this. You can do frequentist decision analysis—that’s fine—but I think it should still depend on real costs, benefits, and uncertainties, not on probabilities calculated based on random number generators that we don’t believe. There is no reason as a frequentist that you have to use the logic by which rejection of straw-man null hypothesis A is taken as evidence in favor of alternative hypothesis B. (Indeed, the above-linked papers demonstrate how this procedure can have terrible frequency properties.)

      2. As a political scientist, I’ve done a lot of analysis of observational data in my applied work, and I do think in practice it’s necessary to simplify our models. But I don’t think it makes sense to do these simplifications based on null hypothesis significance tests. I know all the links are “there” in real life, if I had enough data. The reason for excluding a link is to get more stable estimates (trading off variance for bias) or to get a more understandable model, and both these goals can be considered without reference to a hypothesis test.

      • Thanks, Andrew. This goes a long way toward answering my questions. I also agree with what you are saying; indeed, this is how I usually approach (with slightly different language) significance tests when I cover them in class, for both basics on the tests and DAGs.

        Now, if we can only get the journals and our colleagues to come around …

    • This way of approaching things doesn’t depend on NHST in particular – it depends on the model which test is used – but some tests are needed if we are going to put the data to a severity test.

      There is a distinction between p-values and severity, on the one hand, and (dichotomization via) tests, on the other. In the paper that motivated this blog post, McShane et al allow that p values provide some information that may be relevant to a particular investigation, pointing out that there is value in maintaining the continuous nature of (i.e., in not dichotomizing) p values. The same basic idea applies to Mayo’s severity, it seems to me. In Mayo & Spanos’ paper “Error Statistics,” they define severity for data x and hypotheses H as the probability that x would accord less well with H if H were false.

      I’m changing the terminology a bit to de-emphasize its relation to NHST (e.g., in one part of the paper, instead of saying “accord less well with H” they say “a less statistically significant difference would have resulted”), but my point here is that, like p values, severity does not require dichotomization. It is, at least in principle, a continuous measure of … well, something like model fit, I suppose.

      • “well-testedness” or corroboration or the like. Even though I’m against rigid cut-offs, I do think that having good evidence for a claim C, or C’s being well-tested, or the like are threshold concepts, in the sense that claims that are poorly probed don’t get a little bit of evidence, even though on other accounts they would. Gelman gives the example of inferring a research claim H based on rejecting a statistical null hypothesis ho and inferring a statistical alternative h. H might entail h, but even if there’s legitimate evidence for h, H makes claims that haven’t been probed by the statistical test. The method had poor capability of probing ways that H can be wrong, H is horribly tested, and it doesn’t get a little bit of evidence, even if, as we are imagining, H entails h. That’s why the severe tester rejects the idea of ‘confirmation’ as a boost in probability.
        As for claiming that we already know a (point) null is false: sure we know all statistical claims are strictly ‘false’, but that does not mean you have warranted the existence of a genuine effect. You must be able to show it in checkable ways.
        Anyway, I shouldn’t be adding to a post from last year off the top of my head, there are bound to be errors.

        • My theory is that it’s all because of Virginia Mayo, even though almost no one knows who she is/was. Whenever I’m introduced as Virginia Mayo, I’m ready with the quip: “from Deborah Tech”.

  7. The paper appears to be suggesting that a subjective and relatively informal weighting of evidence be substituted for a flawed objective test – that a calculated number is < 0.05. Having a calculated number at least has the benefit that the number can be recalculated if the experiment is replicated – the writer of the paper has provided a hostage to fortune. An informal weighting of a large variety of evidence is not transparent enough to be independently checked on the same data, let alone re-evaluated in the presence of more data. Even if journal editors were to explain the reasoning behind decisions to reject or accept, they will be open to accusations of choosing a weighting procedure to justify a pre-existing bias, unless they commit to a weighting procedure applicable to all papers before evaluating any.

    One attempt to increase the transparency of relatively subjective decisions is http://eprints.lse.ac.uk/12761/1/Multi-criteria_Analysis.pdf Suggesting such a procedure for the evaluation of scientific hypotheses may come close to sarcasm – but it is at least more transparent than a simple statement of Yes or No, supposedly based on the pooled judgement and experience of an editor or a pool of editors.

    • A.G.:

      Currently, journals do not decide what papers to accept based on any “objective test,” flawed or otherwise. They generally follow a lexicographic decision rule in which “p less than 0.05” is required, but with few if any rules on where “p” should come from and, perhaps more relevantly for the current discussion, a completely subjective procedure for any paper that passes this threshold. If a formal or objective procedure is desired, I’d be happy for journals to use some sort of decision-theoretic procedure with a “paper trail” in that all costs and benefits would have to be specified. But in the present paper, we’re not going there, as the last thing we want to do is give an additional burden to already overworked journal editors. I’d start by recommending that journal editors doing what they’re already doing, which is subjectively balancing many considerations—just without first applying a p-value threshold. All your issues of potential accusations of bias, etc., apply—but they already apply under the current system.

      To take a particular example: if Susan Fiske wants to justify her decision to publish junk papers on himmicanes and air rage in one of the world’s most prestigious scientific journals, I’d like her to justify her decision on scientific grounds and not hide behind p-values. Especially given that PPNAS gets zillions of submissions with “p less than 0.05” results that still don’t get published.

      As Fritz Strack writes elsewhere in this thread, reasoning based on the p-value is one empirical argument among many, and I see no reason it should be given precedence.

  8. Given that the basic assumptions of NHST (finite population, random sampling) are never fulfilled in experimental research it is about time to demote the p-value to an empirical argument among many.

    • If that were the main problems, then we’d be pretty okay in some (many?*) randomized experiments. If you randomized and did an exact randomization based rank-test (using a parametric analysis honestly does not worry me that much, but just for the sake of the argument…) for your single pre-specified outcome, you can certainly make sure that you p-values don’t go below 0.05 in no more than 5% of such experiments under the null hypothesis.

      * I.e. those where no intercurrent events mess up your randomized comparison and make the interpretation harder. The problems are things like when a patient dies in a randomized clinical trial when you said really wanted to compare blood pressure at the end of the trial between treatment groups.

  9. There already is a legal standard that relies on the preponderance of evidence (often described as p>.5). But that can apply in civil cases, specifically not in criminal cases, where the more stringent “beyond a reasonable doubt” standard applies. Much of this has been debated in legal circles and there is a great (old) book on the Benedictin case, where the Supreme Court wrestled with exactly what is considered appropriate scientific and statistical evidence: https://mitpress.mit.edu/books/judging-science. I’m not suggesting the matter is settled – far from it. But these issues are not new, although I’m not aware of cases that specifically put on trial the p<.05 "standard."

  10. Dale– I shudder over the prospect that social scientists might try to adapt practices of proof evaluation used in legal system. As the National Academy of Sciences concluded in a report a few yrs ago, the law is basically scientifically illiterate; *it* should be looking at (good) social science, no other way around

  11. One thing that bothers me about the topic of what to do with p-values is: What does the distribution of anything over repeated samples have to do with the decision to publish a paper that analyzes a single dataset?

    I suppose journal editors could say that they face a decision on whether to publish an article and the journal gets tons of article submissions so frequentist decision theory is kind of applicable. But I do not see journal editors minimizing maximum regret or anything like that. And that would always create conflicts with authors whose incentives are all conditional on their dataset rather than integrated over data-generating processes.

    • Partly because many psychological science types think that the p-value from a single experiment can inform you about replicability in future experiments.

      The Science article on reproducibility encouraged that kind of thinking (by saying that the lower the p-value in the original study, the more reproducible the result tended to be), and now I see more and more psychologists writing the above statement as if it were a fact.

  12. Andrew, you are promoting the abandonment of well stablished inference procedures (with clear mathematic/logical properties) mainly because many practitioners apply them wrongly, as a cooking recipe with little undestanding of their limitations. I agree that in many cases, these tests are irrelevant to the research objetives and can be dropped, but in many others they are crucial and there are no alternatives. Substituting the rigor of crucial analytical testing by subjective evaluations mades little sense in science, at least in the final conclusive stages of a research, I think that at exploratory stages you should not be constrained by these rules, but for a final decision they are the only approach in a wide range of context, and I doubt that natural or social sciences in general are going to embrace a subjective discussion of conclusive experimental outcomes. Having said that, I agree that reporting mechanically all p-values of each coefficients in a regression is not particularly informative, and many reviewers interpret them wrongly.

    • Jose:

      P-values are fine for what they are, and as we explicitly write in our paper, we’re not trying to ban them. What we want to abandon is the use of null hypothesis significance testing as the central step in a widely applied and, we believe generally inappropriate, lexicographic decision rule. We believe that significance test is one piece of evidence, and typically far from the most important piece of evidence, in making a scientific decision.

      Regarding your concern about “subjective discussion”: I’m sorry but we’re already there. What are Susan Fiske’s editorial judgments for PPNAS if not subjective?

  13. Incidentally, some comments seem to suggest that this is a “frequentist” issue, but you may need to test significance of a parameter, either in a Bayesian or in a frequentist context. If you are Bayesian and you think that parameters came from a prior, a specific value of the parameter generated the sample, the posteriori distribution actually concentrates asymptotically on this parameter, and you can do significance tests about the value of the parameter.

  14. I wonder about one usage, but perhaps I am misusing percentiles and p-values. Imagine we have a set of numbers obtained from some biological system. We would like to see is this set of numbers can be generated by chance (or more specific which value in this set can be attributed to chance and which to some process that we think is going on in the biological system). In order to test it we shuffle/modify (by a Monte Carlo method) some biological properties N times, obtain NxM new numbers and draw a distribution from these numbers. We then take p-99 or p-95 value of the distribution. With that number we come back to original set of values and say: all the original values that are > p-99 are not obtained by chance. Do you think it is an OK approach or something else could be recommended?

  15. I was thinking about the paper, and I’d argue that this is addressed to journal editors, not researchers.

    Researchers are supposed to be building hypotheses and gathering data to test them. Even when they find something, they shouldn’t conclude much on the basis of their single study – it’s just incremental evidence, and should ideally be judged that way. The important question the researcher needs to answer in designing the study is about the goal of the study – which involves concerns mentioned in your paper, about Statistical Decision Making and optimal sample sizes that depend on a loss function. (Which the Lakens paper that I contributed a very little bit towards addresses a bit more in depth.) But those decisions should reflect the evidence needed in a Value of Information sense, not the analysis method or power calculation for a given alpha. So the use of p-values should be mostly irrelevant to this process, up until it comes to actually submitting results to a journal. But then…

    Journal editors need some convenient standard to judge newsworthiness of a result. This isn’t important for the advancement of science, as all methodologically sound empirical investigations of a well formulated and important hypothesis in a given domain are plausibly valuable as data. It is instead important for journals who need to maintain prestige by publishing noteworthy results. Because journal editors have this pressure, they are stuck needing a fairly easy standard to apply for noteworthy, and p-values are easy to check. The problem is that in science, noteworthy is the wrong standard – we’d prefer solid results about interesting hypotheses, and good ways to test novel theories that provides evidence regardless of where the results point.

    So for those editors, you are recommending abandoning p-values. That’s fine, but it doesn’t get to the root of the problem with the standard – which is an emphasis on newsworthiness, and that can more easily be fixed by moving to pre-registration and open access to data at the end of a pre-registered trial. And then people can use whatever methodology they want – even if NHST is silly, which I’d agree it is, they can do those analyses and present the results, and if needed, later someone else can do something useful with the data instead.

    • David:

      1. As you say, the p-value is not a measure of noteworthiness. The p-value is a measure of consistency of some subset of the data with some particular random number generator that is not typically of scientific interest.

      2. Preregistration is fine but it doesn’t work in lots of the work in fields such as political science and economics that use observational data, and it doesn’t resolve the problems of poor measurement and poorly designed studies; see for example the discussion near the top of p.2 of this paper.

      3. Journal editors use p-values in a lexicographic decision rule but they still need to use other factors to decide what papers to accept. We’d like editors to take these other factors more seriously and stop hiding behind p-values. For example, if Susan Fiske wants to publish papers on himmicanes, or if JPSP wants to publish papers on ESP, fine, but in either case they should be able to justify their decision, not just say that the papers had p less than 0.05 so that’s enough.

      4. Our paper is not just addressed to journal editors, nor does it just address the journal publication process. We are interested in data-based decision making more generally. See the last sentence of page 8 and all of page 9 of our paper.

      • Thanks for the response.

        2) I certainly agree about the issues with political science and econ; as I’ve said before ( https://medium.com/@davidmanheim/the-good-the-bad-and-the-appropriately-under-powered-82c335652930 ) there are questions that can’t be answered with better statistical tools, because the samples involved are limited. That’s a fundamental limitation about questions where new samples cannot be generated. I just had a closely related discussion regarding a re-analysis of data about wars; http://oefresearch.org/blog/debate-continues-peace-numbers – they essentially used p-values to conclude that the evidence they re-analyzed doesn’t reject the null hypothesis that there is no change, thereby “arguing with” Pinker. This was based on previously collected data, and the method of considering p-values for a new, more complex model to ask whether the evidence supports a previous claim seems even more conceptually broken that the usual use of p-values.

        3) Agreed.

        4) I’m not saying it’s actually addressed to Journal Editors. I managed not to say it explicitly, but I was contrasting this with the Lakens paper (which I was Nth author on, and contributed a very very little bit towards,) which was addressed much more to Researchers. The idea that the choice of alpha should be justified is very closely related to your points about how to choose what to research – but there, we had even more (implicit) focus on how to choose your sample size and study design. The idea that one should abandon NHST has implications for study design, but you more closely focused on what to study based on interpreting previous research, and how to analyze the data. That said, obviously, this is all conceptually related; designing good studies to maximize posterior surprise requires both interpretation of previous results to choose what to study, and implicitly choosing something like an SDT loss function to figure out what to look at, and how hard. What it shouldn’t involve, but currently does, is figuring out what sample size will get you p<0.05.

  16. As an applied statistician, I greatly applaud this work and the direction it is pushing in. I can’t wait for the day that other researchers don’t tune out everything I do until I get to “p ___ 0.05”.

    With that said, I have two comments. One is that, personally (and I don’t expect anyone to rewrite things to please me!), I would like the philosophical argument about the worth of a significance test and the empirical argument about the widespread over (mis) use of statistical test be brought up as two different points. The reason I say this is too much space, even on this blog alone, has been devoted to statisticians of different backgrounds arguing whether p-values are worth $0.06 or $0.0001 to the scientific process…but I’ve never encountered a statistician who thinks the cost of the current use of significance testing in modern science is anywhere under $20.00 (scale is arbitrary but ratio is not)! So rather than having tired, circular philosophical arguments about exact value, I think it’s much easier to achieve actionable agreement that regardless of whether you feel significance testing *can* be useful in the right hands, we for the most part agree that the value it brings is not worth the cost.

    Second, I’m curious what people’s thoughts are about BASP. Now that we’ve got two years of examples of what happens when a psychology journal bans p-values, does it appear that these articles are more insightful than those that allow p-values? This is not a rhetorical question: I’m not familiar enough with either the journal nor the general field to make an informative call on that. I understand this is not getting at exactly what happens when significance is abandoned (i.e. it could be just what happens when significance is not achieved), but it should provide at least a little bit of insight to the early effect of abandoning significance.

    • A:

      Regarding your second paragraph: I think philosophical arguments have their place, as there are many ways to try to understand the world. I have spent most of my career keeping quiet about foundations and just doing stuff, and writing books demonstrating how to do stuff. When writing Bayesian Data Analysis and then again while writing Data Analysis Using Regression and Multilevel Models, I consciously avoided arguments about what methods are better or worse, instead focusing on good practice and the derivations of such methods. And had I spent the past thirty years doing nothing but screaming about hypothesis testing, I think I’d be a lot less happy and the world would be a poorer place.

      Here’s what happened: I’ve had problems with null hypothesis testing for a long time, but for a long time my way of expressing this view was simply to not use such methods except in the rare cases where they seemed appropriate to me. I also explained this position in some theoretical articles, such as my 1995 paper with Rubin on avoiding model selection, my 2000 paper with Tuerlinckx on type M and S errors, my 2003 paper on exploratory data analysis and goodness of fit testing, and my 2011 paper with Hill and Masanao and multiple comparisons.

      Incidentally, one of the cases where I did find some version of null hypothesis significance testing to be useful was in exploring problems with a model used for medical imaging analysis. I felt I got something from measuring the distance of the data to a model that I’d been using. This was in my Ph.D. thesis and it motivated my later work on posterior predictive checking, which started with a focus on Bayesian p-values but has since moved toward graphical visualizations.

      Anyway, my crusade against NHST in recent years came by accident, as I happened to encounter various bad published papers, starting with those of Satoshi Kanazawa and continuing with the well-publicized work of Daryl Bem and all the rest that we’ve been hearing so much about, and I started to realize that the problems with all this work was not just a bunch of individual data-processing mistakes of the Wansink variety, but a larger problem that when classical “p less than 0.05” methods are used in an attempt to extract signal from extremely noisy data, that what will be extracted is something close to pure noise. I got involved in some particular disputes and then it seemed that it made sense to think harder about the general issues. Working on all of this has deepened my own understanding of these problems, and I feel that my work in this area has been a contribution, I hope in some part by motivating researchers to think more carefully about data quality rather than thinking that, just because they have a randomized experiment or a regression discontinuity or whatever, that they can just push the buttons, grab statistically significant comparisons, and claim victory.

      “Circular philosophical arguments” have necessarily played some role in these discussions, in part because classical NHST methods are supported by some theory. The math isn’t wrong but the assumptions don’t really apply—at least, not in many of the sorts of application areas where I see those methods being used—and so it’s kinda necessary to explain why the assumptions don’t apply, to give a sense that, yes, in some settings NHST can be theoretically supported but not in these sets of problems.

  17. More LIGO, each paper that gets published on the topic, the misunderstanding of statistical significance seems to gets more brazen:

    Both pipelines identified GW170814 with a Hanford-Livingston network SNR of 15, with ranking statistic values from the two pipelines corresponding to a false-alarm rate of 1 in 140 000 years in one search [39, 40] and 1 in 27 000 years in the other search [41–45, 57], clearly identifying GW170814 as a GW signal. The difference in significance is due to the different techniques used to rank candidate events and measure the noise background in these searches, however both report a highly significant event.

    https://dcc.ligo.org/LIGO-P170814/public/main [ligo.org]

    Here they come right out and say, “We rejected the background model with p-value = 1/140,000. This number is very small, therefore this event was a GW signal (from two black holes of specific size, etc colliding).” Given that this is one (if not the) most high profile science project of the last few years, I don’t think any abandonment is coming soon.

    • This may well be a case in which NHST is an appropriate tool. I don’t know for sure, but it wouldn’t surprise me if both (a) the null model in this case is a well-formulated quantitative model of the system in question without a gravity wave signal, and (b) the experiment was designed and controlled sufficiently well to rule out any plausible cause of a signal other than a gravity wave.

      • the experiment was designed and controlled sufficiently well to rule out any plausible cause of a signal other than a gravity wave.

        They definitely do put a lot of effort into that, to the point of publishing entire separate papers about it (which is good). But shouldn’t each “ruling out” of each alternative have some kind of probability associated with it? Then plug all that into Bayes’ rule and say “assuming nothing we didn’t check for is going on here is the probability it is a GW”.

        Clearly they do not rely only on predictions of GR regarding inspiraling black holes, or they wouldn’t do this type of “reject background” analysis. Many other things must cause such signals that correlate across sites in their equipment as well, right?

  18. Deborah Mayo writes (on twitter):
    “To claim we shouldn’t test for the (statistical) significance of observed differences is tantamount to recommending we ban Modes Tollens [sic].”

    I replied:
    “We dont have modus tollens. How does that exist with probabilistic statements?”

    Mayo:
    “it’s stat MT practically all we ever have: With several tests: Prob(test T < do; Ho) = very high, yet reliably produce d ≥do, thus not-Ho."

    Me:
    "But with that logic, all sorts of inferential methods could be used. If H0, then BF 0|y) = .99; therefore not H0.”

    Am I wrong here?

    Essentially nil-null hypothesis testing isn’t an expression of modus tollens, because modus tollens doesn’t include a probabilistic statement. “If A, then B. Not B, therefore not A” doesn’t really work when you have “If A, then probably B; probably not B, therefore probably not A”.
    But if you generalized modus tollens to just be, generally, “If A, then probably B; probably not B, therefore probably not A”, then lots of inferential frameworks would permit modus tollens anyway. “If H0, then theta probably negative; theta probably not negative, therefore probably not H0”; and that’s a posterior statement regardless of NHST or significance testing.

    • Stephen:

      You quote Mayo as saying, “With several tests: Prob(test T < do; Ho) = very high, yet reliably produce d ≥do, thus not-Ho." That's all fine, but in just about every problem I've seen, I know ahead of time that Ho (the null hypothesis of zero effect and zero systematic error) is false. So inferring "not-Ho" doesn't really do anything for me. As I've also said many times in various blog comments and elsewhere, I do think there are problems where Ho could be approximately true, in examples such as genetics or disease classification, examples of discrete models. But I don't think this reasoning applies in settings such as "the effects of early childhood intervention" or "the effects of power pose" or "different political attitudes at different times of the month" or "the effects of shark attacks on voting." These are settings where we have every reason to believe that there are real effects, but these effects will be highly variable, unpredictable, and difficult to measure accurately. Rejecting Ho tells me nothing in these settings.

      • Or,

        (a) If it rains the grass will probably be wet;

        (b) The numbers of non-wet and wet grass observations in my sample would rarely be seen after a rainy day;

        (c) Therefore, either it probably didn’t rain or rain probably doesn’t make the grass wet

        isn’t Modus tollens. But it does sound a lot like the problem we’ve been discussing.

        • Thanatos:

          Yes, whether it rains or not is a discrete outcome, and for this, discrete models make sense. Similarly, I think discrete models make sense for some questions of the form, is Trait A associated with Gene B. But I don’t think discrete models make sense for questions of the form, is embodied a real thing.

        • Which piece of grass are you observing?

          What is the null model for ‘rainy day’?

          Do you include whether or not sprinklers are nearby?

          If I spill a glass of water on the grass does that count as wet enough?

          These might seem like trolling questions but this sort of thing can matter in the high noise, highly variable and imprecisely measured situations we’re typically talking about.

          Even in super precise contexts we need more than basic logic – that’s why we have mathematics! Most of the philosophy of science I’ve read seems to have a real obsession with starting from things like ‘H: Newton’s theory’. I find it hard to take anything that starts from that point too seriously.

          In my view scientific theories are not (or almost never) simple Boolean propositions and hence don’t have simple truth values as such. Some people seem to freak out at this and think everything is about to go postmodern but I think that’s more like the argument that you can’t be good without god than anything too serious.

      • Andrew:

        I agree; I didn’t mean to imply otherwise. I don’t like dichotomization of evidence at all. I tend to take a stance that embracing uncertainty is useful, and if a decision is to be made, it’s with respect to some posterior uncertainty (or that + a utility function). I’d rather describe a posterior distribution, or in the least some estimate + some error, and either let the reader decide for themselves or formalize a decision in the context of theory or prediction. E.g., we have had posteriors on interactions that were essentially N(0,.1), and I more or less said “if there’s an effect, we can’t tell whether it’s + or -, but regardless it appears to be tiny and negligible.

        But with that said, IF one makes a decision about some substantive hypothesis, I just meant that with the logic of probabilistic modus tollens, posterior quantities or basically non-NHST derived quantities can tell you the same thing. IF you’re going to make a decision about some substantive hypothesis, you can do so via the posterior. “If H0, then negative; effect probably positive, therefore not H0.”

        I posted this elsewhere recently, which is relevant to this whole discussion.

        I think one can say that the VAST majority of nil-nulls are false, to the point where nil-nulls should be the exception, not the rule.

        And exactly: If nil-nulls are false, then there really isn’t a point in using a nil-null hypothesis test.

        True nulls can exist, in very very few cases. E.g., when a physical law is in question. Like… ESP would actually require some rethinking about foundational physical laws if it were true; given our physical constraints, there truly can be a ZERO effect, as in it truly cannot occur.

        But in the social sciences, we rarely have these sorts of questions, because most of our questions don’t require rewriting a constraint due to a physical law. Otherwise, everything is connected in some way, and out to SOME decimal place, there is an effect of everything on everything else.

        The only realistic cases in psych where some statistic may be truly zero is only when the measurement sucks too badly to even detect a difference in the population. Like, imagine on some latent continuum, the statistic is .000001, but our measure only permits .1 differences at its smallest. Then even if we had every single person in the population, we would see zero difference, but only because our measure sucks, not because there’s literally zero difference. In this case, conditional on our measure the true effect could be “zero”, in the sense that asymptotically and with everyone in the population, the true value is zero; but that is not true unconditional of our measure. With a more precise measure, the true value would NOT be zero, but perhaps .000001.

    • I love this last sentence:

      “If H0, then theta probably negative; theta probably not negative, therefore probably not H0”; and that’s a posterior statement regardless of NHST or significance testing.”

      Yes – and I think the generalization of modus tollens being used in practice is more like “if A, then probably not B; B, therefore probably not A”:

      If H0, t(x) probably not greater than t*; t(x) greater than t*, therefore probably not H0.

      Now, I know Mayo and other proponents of significance testing doesn’t actually use this reasoning. They know that, in classical statistics, we don’t get to say “probably not H0”; we can only say that we reject or fail to reject H0 according to a procedure that has a given error rate, and the strength of our inference rests upon the error rate of the procedure.

      But I strongly suspect most significance tests are performed and interpreted by people who have “p<0.05, therefore probably not H0" in their heads. And I think this is because the frequency interpretation of probability is not intuitive, except in the canned examples taught in intro stats classes. I think that most people treat probability as quantifying uncertainty, and so probabilities can apply to truth statements, e.g. H0. And since probabilities can apply to truth statements, the p-value can (and does!) tell us the probability of H0. I brought this up once on Mayo's blog and my interpretation of her response is that she doesn't agree that scientists typically think this way – I think she sees the natural way of applying probabilistic thinking to scientific questions as "there is some fixed truth out there in the world, and probabilistic statements cannot be made about this truth, they can only be made about data gathered by a random procedure." And so this whole brouhaha about people misinterpreting p-values and misinterpreting the results of significance tests comes down to the critics of significance testing unfairly assuming that users and consumers of significance tests don't really understand how they work. Hopefully I'm not misrepresenting her.

      Needless to say, I don't think that (most) users and consumers of significance tests really understand how they work.

      • I heartily agree with your last sentence. (In particular, I’ve seen a lot of textbooks that get it wrong, or at least explain significance tests in such a watered-down version that it’s subject to misinterpretations.)

      • If Mayo in fact does believe that most people think in a way approximately captured by your quotation marks (“there is some fixed truth … random procedure”), then I would be fascinated to read her defence of this belief.

        • Kyle, I looked back over the comments thread in question and I put some words in her mouth in my post above. Here it is:

          https://errorstatistics.com/2017/08/10/thieme-on-the-theme-of-lowering-p-value-thresholds-for-slate/

          She does state that the classical “error probabilistic” method is “the most natural way in which humans regularly reason in the face of error prone inferences”, and that “you could say that when people assign the probability to the result they are only reflecting an intrinsic use of probability as attaching to the method”. This was in response to my claim that when people interpret the p-value as “the probability the results occurred due to chance”, they are mistakenly interpreting the p-value as the probability of the null. I take her response to mean that she thinks people who interpret the p-value as “the probability the results occurred due to chance” are aware of the fact that the probability statement refers to the error rate of the procedure and not the null itself.

          The part about “fixed truth” was embellishment on my part.

          https://errorstatistics.com/2017/08/10/thieme-on-the-theme-of-lowering-p-value-thresholds-for-slate/

        • Mayo’s prose is so impenetrable that she can always claim to have been misunderstood. Perhaps even “impenetrable, so that …. “

        • Gee I sure hope that my new book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars is much clearer than whatever you have read. I accidentally came across this post today. I’d love to reply and (in several cases) correct the intended logic of testing, but it would be too, too much. All I can do is ask that you try my new book. Thanks.

    • With several tests: Prob(test T < do; Ho) = very high, yet reliably produce d ≥do, thus not-Ho.

      Modus Tollens applies to one-off situations. If we have “If A, then B,” then a single observation of not-B suffices to conclude not-A. In addition to this not working in general when probability is added to the mix, this (appropriate) mention of reliable production of particular results makes it clear that we’re not talking about one-off situations. Rather, we’re talking about exactly the kind of thing that McShane et al discuss (among many others, including Mayo in other places), namely how design, measurement, and statistical tools relate to each other, and how these jointly license inferences.

      This quote also make it look an awful lot like Mayo granting that a single statistically significant result carries little, if any, evidential weight, since by itself, it doesn’t indicate that we can, in fact, reliably produce a particular effect. This bolsters the argument from McShane et al that statistical significance should not be used as a publication filter.

  19. You’re not wrong; there’s more than one extension of logic to values beyond just the elements of the set {true, false} that can be used for data analysis, each with its own version of extended modus tollens (but we all know which one is the best one, nudge nudge, nudge nudge, know what I mean, say no more).

    A quibble: even in the probabilistic version the minor premise is just “not B” — the data are whatever is observed, by definition. That is, the p-value version of modus tollens goes,

    Major premise: If H_0 holds then the probability of observing a realized value of X less than x is some value p close to 1.
    Minor premise: X = x.

    Therefore, H_0 is false or else something improbable has happened.

  20. Andrew,

    I read your paper and this post with great interest. It’s brilliant and thought-provoking, as usual.
    But, nevertheless, I must say that I am a bit flabbergasted by one point: while you remind of the different flaws of the p-value, your proposal for the future do nevertheless still include p-values in the general process of statistical learning and decision. In my view, this may considerably blur the message.

    Maybe I am totally mistaken and p-value may have some interest, but which one ? I really don’t understand and if it has an interest, what is wrong in all the papers that for so long have opposed its flaws ?

    My understanding of the issue is that the problem is not (only) in the .05 threshold used, or whatever value it may take, but in the efinition and nature itself of the p-value, and its native defaults : non respect of the likelihood principle, lack of incorporation of prior values etc. Moreover, the Cox theorem indicates that the only valid interpretation of probability must be done based on the Bayesian paradigm. Then, be they melted or not in a more global view of the statistical analysis of a study, p-value still have those defaults. What is the use of a defective tools and thus what is the use of including it in the statistical reflexion ?
    So, I fear that keeping p-value, even when incorporated in a more flexible and fluid reflexion on the data and after getting rid of any magic threshold, will not be accompanied by a better reflexion. I work as a biostatistician in a french medical university and I can assure you that almost all my clinician colleagues cut information first (more or less than 0.05) and then (sometime) discuss the methodology of the study, while we should do the reverse and publish all results, be they positive or negative. They have been wrongly and empirically but heavily taught by the literature for so long to cut anything that comes up at a 5% levels. As long as there is something to cut, they will cut it first, and have a reflexion afterwards. It will take decades before we may get rid of p-values if physician were « authorized » to use p-values, even without threshold and even along with other statistical tools.
    The only way out of this mess is to completely get rid of p-values. There will be shortly the ASA meeting on “the world beyond p < 0.05 ». I think that a very clear message must be given, otherwise it will be difficult to advocate a p-value-less world. Leaving value make their way in a paper will only muddle a study message. My colleagues judge almost only by the p-value, and habits are such that if p-values continues to appear in the results they will still be the only piece of evidence that they will rely on. I would bet on that : I am certain to win. The reproducibility issues, the selection of only positive value will then go on.

    • “Moreover, the Cox theorem indicates that the only valid interpretation of probability must be done based on the Bayesian paradigm”

      No, this is not true, the Cox theorem says that the only generalization of predicate logic that agrees with binary logic in the limit is probability calculus.

      The frequency interpretation of probability is a totally valid mathematical construct. It corresponds to the behavior of certain types of infinite sequences of numbers. The question really is whether it has anything to do with data collection in scientific enterprises.

      p values do exactly one thing, they tell you how probable a certain range of test statistics is if the data comes out of an algorithmically random mathematical sequence. There is nothing wrong with this logic. What’s wrong is *applying this logic where it’s inappropriate*.

      although I agree with your diagnosis of what is wrong with people who use p values and how they cut first and ask questions later… I think it goes too far to say that we should somehow “ban” the p value. What we need to ban is science done by people who don’t care about logic.

      • Daniel,

        Thank you for your clarification the Cox theorem.

        ” The question really is whether it has anything to do with data collection in scientific enterprises. » That is THE question. I agree, of course, with the frequentist interpretation of probability and recognize that p-value may be useful, but only under specific situation, for instance to check properties of an estimator, a situation that pertains to the statistician, not the lay user of t-test and chi-square.
        But, if I understand and see the logic in the computation of a p-value in the framework of NHT, I do not understand the usefulness of “data more distant from the null than the observed one” in decisional problem. This is the specific context I wanted to talk about. From a practical (at least medical) point of view, I don’t see how they could be justified (hence my remark on the likelihood principle). It is as if a physician would consider a body temperature of 40° when in fact the patient is 39°. Non-observed data are not relevant in the process. Do you have a real, practical, exemple of a situation in which using a p-value would be really useful ?
        A ban on p-value may be too strong an answer but I stick to the fact that the message to the community of users must be very clear. Maybe the real or major problem is the fact that the majority of consumers and producer of statistics are not statisticians themselves. As I always say to my colleagues, I do not do surgery in my kitchen, as an amateur, please do not do statistics on your own. They will go using a recipe, and the recipe must be as clear as possible.

      • > No, this is not true, the Cox theorem says that the only generalization of predicate logic…

        Very off topic but Cox’s theorem doesn’t generalise predicate logic, it ‘generalises’ simple _propositional_ logic. Extending probabilistic logics to interact well with quantifiers is to some extent an unsettled question.

  21. I am clearly not as smart of familiar with the issue as you guys so please correct me if I misunderstand the point. From what I can tell, the argument is that is not possible to adequately describe and evaluate a complex system. The crisis in the biomedical and social sciences comes from the fact that true empirical evaluation is impossible and that creates a replication problem that can be ‘solved’ by ensuring that those who are smart enough to cut through the verbiage will understand that the results are not particularly meaningful in a general way. But that leads me to question the need for all of the supposed ‘scientific’ testing in the first place. Why not just say that the uncertainties are large so anyone wishing to use a particular product for medical purposes is taking risks that are not easily quantifiable? That would mean that we would not have to waste massive amounts of resources trying to go through bureaucratic approval processes that are denying individuals access to possible cures and that manufacturers of dangerous products cannot hide behind a faulty approval process.

    Can anyone of us imagine what someone like Richard Feynman would say when we talked about 95% confidence intervals being meaningful in any way?

    • “From what I can tell, the argument is that is not possible to adequately describe and evaluate a complex system.”

      I think this is often the case, but I’m not sure it’s the crux of the problem. Even applied to relatively simple systems, significance testing combined with noisy measurements and small effects and publication bias and flexibility in analyses and hypothesizing will lead to the same problems – people using patterns in noise to fool themselves and others into thinking they’ve discovered underlying truths.

      I do agree that significance testing can be especially dangerous in complex environments insofar is it leads people to think in simple terms and downplay uncertainty.

      • “I do agree that significance testing can be especially dangerous in complex environments insofar is it leads people to think in simple terms and downplay uncertainty.”

        +1

  22. Wish I’d seen this when it first appeared, when I was writing this opinion piece: https://www.americanscientist.org/blog/macroscope/dont-strengthen-statistical-significance-abolish-it

    The lexicographic gatekeeping function (or as I call it “decision criterion science”) is pernicious in just so many ways, and prevents a clear understanding of the true evidentiary value of each given result, and how it must be considered in combination with others.

    • Shlomo:

      You write, “Nothing magical happens when going from a p-value of 0.051 to one of 0.049.” But it’s much worse than that! Actually, nothing magical happens when going from a p-value of 0.005 (z=2.8) and a p-value of 0.2 (z=1.3): the difference between these z-values is a mere 1.5 which is nothing remarkable at all, given that the difference between two independent random variables, each with standard error 1, will have standard error 1.4. This was the point that Hal Stern and I made in our paper, The difference between “significant” and “not significant” is not itself statistically significant. The problem is not just with an arbitrary threshold, it’s a more fundamental issue that the p-value is a very noisy statistic. Practitioners seem to think that p=0.005 is very strong evidence while p=0.2 is no evidence at all, yet it’s no surprise at all to see both of these from two different measures of the very same effect.

      • Wow – a significant point indeed! This is something that clearly needs to be widely known. The “magical threshold” problem, of course, is much more general.

        (Would you mind writing this as a comment to my post as well, for that readership, or may I copy it there? Thanks!)

Leave a Reply

Your email address will not be published. Required fields are marked *