Zoologist slams statistical significance

Valentin Amrhein writes that statistical significance and hypothesis testing are not really helpful when it comes to testing our hypotheses. I’m not quite sure I like the title of Amrhein’s post—“Inferential Statistics is not Inferential”—as I think of parameter estimation, model checking, forecasting, etc., all as forms of inference. But I agree with his general points, and it’s good to see these ideas being expressed all over, by researchers in many different fields.

47 thoughts on “Zoologist slams statistical significance

  1. It depends. If you take Popper’s account to mean “all we can know is what isn’t so” then improbable results, ill-fitting models and inaccurate predictions seem to be a pretty sensible way of inferring that a particular hypothesis hasn’t the advertised explanatory power. But if you’re trying to infer whether a hypothesis is true from a successful test of it (which is what I glean Amrhein thinks is what such tests are for) then you’re in for perpetual disappointment. As A.B. Hill said:

    “All scientific work is incomplete – whether it be observational or experimental. All scientific work is liable to be upset or modified by advancing knowledge. That does not confer upon us a freedom to ignore the knowledge we already have, or to postpone the action that it appears to demand at a given time.

    ‘Who knows’, asked Robert Browning, ‘but the world may end tonight?’ True, but on available evidence most of us make ready to commute on 8:30 the next day.”

    What’s needed then is a practical theory of evidence that includes at least (to make up words) testedness, disconfirmedness and generalizableness. And as far as I can tell, like it or not, each will lead right back to drawing inferences from statistics.

    The problem is that Nature gives directions like my dear Mom: “You know that church with the huge oak tree on the corner?” “Yes” “Well, don’t turn there. Keep going. And you know where Elm Street runs into Pine?” “Yes” “Don’t turn there either. You’ll turn right on a street like Hawthorn with a grocery store on the block but don’t turn on Hawthorn Street because it’s not the right one.” You’re never told where to go, only where not to go.

    • This should be in my other post.

      popper0:

      Popper0 is the imaginary author of a vulgarised version of Popperian philosophy of science, a phantom created by Ayer, Medawar, Nagel and others. I discuss him only because he is much more widely known than the more sophisticated Popper1 and Popper2.

      • I would have thought that around here the example of Bargh’s priming train wreck, a progressive research programme if ever there was one, would have dampened enthusiasm for Lakatos.

        • My impression is that “priming” and the like is largely done via NHST. Lakatos called that type of research something like (not looking it up, but my paraphrase): “intellectual pollution that is in danger of destroying our cultural environment before we get a chance to destroy the physical environment”.

      • I just wonder why we even need to cite Popper or even other deceased philosophers as much as we do. I think we have to have more confidence in our own intellectual abilities to formulate theory and practice. It’s only a ha handful of philosophers of science that are routinely cited. Popper leads the pack.

        • Without philosophy of science you get people putting almost no thought or effort into why they are doing whatever it is they are doing. Then you get beliefs like “p < 0.05 therefore it is true”, “written in the textbook therefore it is true”, or “passed peer review therefore it is true”. Currently philosophy of science is minimized or lacking altogether, and these types of naive beliefs are widespread.

          Personally, I think the philosophy and history of science should have a much more prominent role in graduate training.

        • I agree that philosophy and history of science should have a much more prominent role in graduate training, part of the challenge is preventing it from becoming as one philosopher put it “asking childlike questions and arguing over possible answers like lawyers”. Its study needs to be made scientifically profitable for those doing graduate training in their chosen empirical areas. I think Richard McElreath has nicely incorporated some of this in his Bayes course.

          On the other hand, often when someone cites Popper they have read very little on him directly if at all. Its sort of like a claim of I know a little something about philosophy of science and that’s adequate.

        • There’s not a shred of evidence studying philosophy or philosophy of science makes people better scientists.

          Interesting, I wonder how “better scientist” would be assessed and whether this definition may actually be the problem here. Eg, “better scientists are those who publish more papers. You can publish more papers if you don’t know what a p-value means”, etc. It could indeed turn out that “better scientists” are the most ignorant of the bunch.

          Do you have a link to whatever lead you to think this?

        • I don’t think there needs to be any evidence aside from those who claim it helped shape the way they thought about the problems they study. Philosophy is only one of many paths to improvement of our methods. We might arrive at the same place simply by focusing on the importance of measurement or coming across a discarded theory or recognizing a flaw in a widely accepted approach.

        • The amusing thing is if you ask a Philosopher of Science for evidence instruction in their opinions make scientists better, they’ll sputter something like:

          “Philosophy of science teaches how to evaluate evidence, therefore taking my class makes scientists better”

          And they’ll say it with a straight face, apparently unawares of the heaping layers of irony involved.

        • Anonymous. I hate to say this. But anytime one asks a specific question of a philosopher of science, one can bet that something incomprehensible is about to be written or spoken. I wonder if it is just bad writing and speaking. You cracked me up this evening.

        • Sameera, ponder for a moment the weightiest word in the opening sentence from your link – “systematic”. This idea, that science, like widget manufacturing, can be reduced to a Metropolis (the movie)-like factory whereby the product, science, is produced incessantly by the turning of the proper valves in the proper sequence, is what led us into the dead-end in which we find ourselves. That real knowledge, like my recent observations of the very weird (and smelly) orange fungus emerging out of the roots of the tree-smothering vine I finally killed last Fall, rejects the square categories into which I think the dead roots of tree vines ought to be roundly pounded, is a feature of Nature, and hardly a bug. But it is, perhaps, a clue.

        • Thanatos,

          That example of the ‘fungus emerging out of the roots of tree’ is so spot on. Trying to fit phenomenon into neat categories and graphs is so inadequate. Why relying on the same few philosophers & philosophers of science also can anchor us to ideation that may even mislead us in research. No, I think we have to exert more of our own imagination to the current science related challenges before us.

        • I gather that at least good percent thought leaders in nearly every field has some philosophy and philosophy of science background. At least the ones I have interacted with. I am only suggesting that in statistics and psychology only a small % of philosophers or philosophers of sciences are repeatedly in citations. I understand their inclusion is also to enable publication in journals. Some citations seem superfluous and may even retard the prospect of new thinking. I think Rex Kline has made a similar point.

          History of science and philosophy can be useful. Don’t get me wrong. But whether it has improved statistics or psychology, is open query.

  2. If you take Popper’s account to mean “all we can know is what isn’t so”

    This is the so-called “popper0″[1].

    What’s needed then is a practical theory of evidence that includes at least (to make up words) testedness, disconfirmedness and generalizableness.

    We have this, it is determining how well can you predict the future. All those philosophical problems with evidence/etc are due to people who can’t do this very well trying to come up with reasons for their speculations to seem more important than they are.

    [1] Lakatos, Imre. 1968. Criticism and the methodology of scientific research programmes. Proceedings of the
    Aristotelian Society. Issue 69, p. 149-186.

  3. I quite like this bit:

    “I will then interpret the confidence interval as a “compatibility interval,” showing alternative true effect sizes that could, perhaps, be compatible with the data (if every assumption is correct and my measuring device is not faulty).”

    Has “compatibility interval” as a replacement for “confidence interval” been suggested previously? I googled a bit and couldn’t find an example (it apparently sometimes gets used for other things but not this).

    • Think “compatibility interval” arise out earlier work by Sander Greenland some likely joint with Amrhein.

      For instance see Amrhein’s link to Greenland in his post.

      Now, I do think a better title for Amrhein’s post would be Inferential Statistics [as currently practiced in many/most fields] is not Inferential despite its being uglier.

      • Thanks for asking – indeed I borrowed “compatibility interval” from Sander, from an abstract on a talk he will give next week in Basel:
        http://bit.ly/2FjrpXW
        https://pphs.unibas.ch/curriculum/ws-bayesian-and-penalized-regression-methods-for-nonexperimental-data-analysis/

        I think Sander will explain this in a forthcoming paper, but he just sent me the following lines: “The compatibility interval is a natural way to summarize compatibility P-values across the range of a parameter, holding other assumptions fixed. There is no coverage rate or posterior probability expressed or implied; one can only say the models with values in the interval have little data information against them given those assumptions.”

        • Here is a quote from our review (https://peerj.com/articles/3544) that might be helpful:

          “In fact, the null hypothesis cannot be confirmed nor strengthened, because very likely there are many better hypotheses: ‘Any p-value less than 1 implies that the test [null] hypothesis is not the hypothesis most compatible with the data, because any other hypothesis with a larger p-value would be even more compatible with the data’ (Greenland et al., 2016). This can be seen when looking at a ‘nonsignificant’ 95% confidence interval that encompasses not only zero but also many other null hypotheses that would be compatible with the data, or, in other words, that would not be rejected using a threshold of p = 0.05 (Fig. 1C; Tukey, 1991; Tryon, 2001; Hoekstra, Johnson & Kiers, 2012).”

        • I’m guessing you meant this (which I expanded to tie it into my reply to Daniel below):

          Any observed P value may stem from a model violation to which P is sensitive (e.g., nonrandom selection); that is why, unconditionally (which is to say, in reality), small P-values do not require and thus cannot imply violation of the null, and large P-values do not require and thus cannot imply truth of the null. A small P does however alert us that the data and the entire model (set of assumptions) used to compute P don’t look very compatible according to the refutational information measure s = -log_2(p).

          – A point hopefully worth repeating, anyway.

        • I was just pointing to the idea of focusing on an interval of parameter values (assess the data’s compatibility with a range of parameters values rather than just the zero effect) without going into the bigger picture.

          Part of the challenge was not having a paper to reference when I made the comment.

      • That’s it, thanks! But it’s funny – the only place I can find the term “compatibility interval” is in a seminar abstract. For example, here’s a seminar by Sander Greenland coming up next month:

        https://www.unibz.it/en/events/127187-the-unconditional-information-in-p-values-and-its-refutational-interpretation-via-s-values

        The seminar abstract concludes:

        “I thus recommend that teaching and practice reinterpret P-values as compatibility measures and confidence intervals as compatibility intervals….”

        But when I checked the seminar paper via the link provided, it’s this paper:

        Sander GREENLAND, Stephen J. SENN, Kenneth J. ROTHMAN, John B. CARLIN, Charles POOLE, Steven N. GOODMAN, and Douglas G. ALTMAN
        “Statistical Tests, P-values, Confidence Intervals, and Power: A Guide
        to Misinterpretations”
        Online supplement to the ASA Statement on Statistical Significance and P-Values, The American Statistician, 70.

        And while the paper has a lot of discussion of p-values, confidence intervals and compatibility, the phrase “compatibility interval” doesn’t seem to appear.

        Same thing for the 2016 Eur J of Epidemiology paper by the same authors – the discussion is there but the term “compatibility interval” is not.

        Curious….

      • I’m fine with likelihood functions and extensions as appropriate, but:
        First, those aren’t always available (as in high-dimensional/sparse data confounding-control problems; see the new article by Athey, Imbens and Wager in JRRS B for a nice summary of papers on that literature).
        Second, pure likelihood as I’ve seen it is interpreted in terms of relative support (supporting evidence) measured by likelihood ratios or their logs, e.g., as in Edwards’ book; for that purpose, explicit alternative models are essential.
        This relative support is quite different from compatibility as used for P-values (which is quite old usage in epidemiology at least; the 1-alpha compatibility interval follows as the interval for which P>alpha): following Fisherian tradition P supplies strictly refutational (negative) information; being a tail probability of a discrepancy statistic that summarizes the model’s misfit of the data, it needs no alternative model to compute or even interpret refutationally (a lot of folks realized Newtonian mechanics was being refuted by data before a decent alternative came along). It seems no surprise then that those of Popperian bent seem to prefer P-value functions over likelihood functions, even when they deplore Neyman-Pearson alpha-level testing (e.g., see Poole AJPH 1987, cited in Greenland et al. TAS 2016). And P can be computed easily in many cases where likelihood can’t (or computing likelihood requires additional assumptions that may be dodgy).
        I think keeping the profound distinction of likelihood and P evidence in mind helps reconcile pure likelihood with Fisherian testing as complementary approaches to measuring the information the data supply about models. That skirts the implications of very regular cases in which the integrated likelihood function and P-value function coincide numerically (in which P becomes a limiting posterior probability as the prior goes flat); I like to imagine there is some profound truth lurking in that coincidence, but some say it’s just one more deceptive oversimplification spawned by normality.

        • > I like to imagine there is some profound truth lurking in that coincidence, but some say it’s just one more deceptive oversimplification spawned by normality.
          Hope for the first, plan/act for the second.

          > And P can be computed easily in many cases where likelihood can’t
          My sense was the opposite analytically, is the P being computed by Monte Carlo?

        • I was thinking of semiparametric models where the “nuisance parameter” is an unknown continuous function. A paradigmatic example (at least for me) is the g-test of Robins for the survival-time scale (aging-acceleration) factor in continuous structural nested failure-time models with confounders, where even without censoring the likelihood function is inconsistent if no assumption is made about the baseline survival distribution to reduce it to finite dimensionality (Neyman-Scott revisited), and unlike the proportional-hazards case no partial likelihood is available; but a consistent P-value function is easily constructed using rescaled failure times. Then there are the examples in Robins & Ritov and Robins & Wasserman where likelihood-based inferences are inconsistent but again a consistent P-value function is available from the inverse allocation or sampling weights.

          I don’t see these examples as saying anything more or less than P-values and compatibility intervals (CI) are particularly valuable for analyses (like general badness-of-fit tests) in which one wants such weak specification of (constraints on) the underlying model family that likelihood-based (including Bayesian) analyses are no longer tractable or reliable. But even when the specification allows one to use them all, I think they visualize different aspects of the model-data relation, so I see them as complementary rather than competitive. Specifically, a P-value function displays an unconditional absolute relation for each model in the graphed family (showing the percentile location of the test statistic in a distribution derived from the model), and in the one-sided case becomes a limiting CDF. In contrast, a likelihood function displays changes in the relation across the model family, and so corresponds to a derivative (or limiting density, if normalized to integrate to 1). In these cases they are functions of each other (at least under typical approximations) and so some say contain the same information; while that may be true in some strict math sense, they can convey different information to a human viewer (at least to this one).

        • So here’s a thing you can do: take a normal distribution and chop out a part of the sample space, say, the interval between -1 and 3. Boom! The p-value function no longer coincides with the integrated likelihood. The former involves an integration over the sample space and the latter involves an integration over the parameter space, and while the parameter space is still the reals the sample space has a bite taken out of it. (In fact, I just blogged all about this.)

        • That article was great by the way. I’ve been sick a few days so I didn’t get around to leaving comments on it. In my opinion the SEV based inference made no sense, you see x=-1 and then decide that the mu parameter is almost certainly positive? Just obviously a wrong conclusion.

        • Thanks Corey, I think I see your point, but I’m left unclear about the meaning of the coincidence in the very regular cases in which it occurs – cases which are routinely assumed in practice (even if they don’t routinely hold in practice). It looks to me like the conversion of a refutational measure (the P-value) into a support measure (the integrated likelihood) follows from the restriction to these cases, and I was puzzling over what this says about these common models of refutation and support. Maybe it’s only a simple conditional conclusion, like: When you are assured of their equality as in these cases, you needn’t be too concerned about confusing the two concepts of evidence; but when you can’t you had better keep them distinct.

        • Sander: I think it goes deeper. In Corey’s example the SEV calculation shows that seeing x=-1 “refutes” (at 99%tile) the idea that mu is less than 0.

          That is clearly wrong.

          When p values are based on likelihood calculations, their “goodness” seems to come from the fact that they’re Bayesian calculations in disguise, and seems to be due to the parameter space and the sample space being identical, and hence you can “secretly” integrate over the parameter space without admitting it because integrating over the sample space is the same integral.

          However, when you break that symmetry: the frequentist inference falls apart as shown in Corey’s example. It makes no sense to say that x=-1 makes you virtually sure that mu is greater than 0.

          Of course there are frequentist methods that don’t involve integrating likelihoods, but for those that do, it seems obvious that they work because they’re bayesian, not the other way around.

        • Daniel, I want to be clear that I’m not saying,”that is clearly wrong”. Like, I affirm that my opinion is that the situation is as you describe, but I’m not willing to take the step from “this is what my gut says and also my considered opinion” to “this ought to be obvious to everyone”.

        • Sorry Daniel, but I don’t see it that way at all:

          As a preliminary, I don’t subscribe to and hence did not mention the Mayo-Spanos severity interpretation, or for that matter any behavioral or decision interpretation, so I’m not sure why that was brought in.

          Instead, I use an unconditional version of the continuous Fisherian interpretation, in which the observed p is a transform of a particular directional discrepancy d from the tested model; specifically, 100p is the upper-percentile location of d in a distribution derived from the tested model. That model includes every assumption needed to make the random P uniform (maxent) on (0,1), whether recognized or not (e.g., no coding errors, no fraud, no hacking up or down). The Shannon, self-information, or surprisal transform s = -log_2(p) then is an unconditional measure of the information against this model, in units of bits.
          I emphasize that s measures only refutational information: small S-values cannot logically provide unconditional evidence for the model, because many other models will fit as well or maybe better according to the same d. But no alternative model is needed for this information measurement.

          This unconditional refutation logic (compatibility interpretation) for P-values is nothing like the support logic in likelihood or Bayesian accounts: The latter need to restrict the model family enough to allow specification of alternatives over which the likelihood function and prior and posterior distributions will be defined. That means they condition out uncertainty about unmodeled assumptions (like no-posterior-hacking) and so are prone to producing overconfident inferential statements. That overconfidence also plagues the usual conditional interpretations of P-values and CIs (e.g., null-rejection, confidence, and coverage claims) but is not intrinsic to the latter, and is removed by backing off to the unconditional compatibility interpretation given above (where the latter is formalized using the amount of information against models, as measured by S-values).

          Bottom line: Valid (uniform) P-values have (through S-values) a “goodness” sorely lacking from likelihood-based measures, in the form of an unconditional interpretation, and a simplicity they lack in terms of not needing a formal restriction of the model space. But then, a P value is addressing a different question from likelihood: It is measuring absolute badness of fit in the D-direction (in the usual stat tradition, woefully misnamed “goodness of fit”), whereas likelihood functions and posteriors are comparing models within an explicit domain according to a different criterion entirely. Given their profound foundational differences, it seems to me rather intriguing that the most commonly assumed limiting case in stat history brings them into some kind of numerical agreement.

        • Sander: I brought up SEV because it’s part of the calculation that Corey did, not because I was claiming that you subscribe to its validity or anything like that.

          I don’t have too much problem with p value as measure of discrepancy of a particular model. I’d just say that it’s scientifically meaningful when you have a *real* motivated model that predicts frequencies of occurrence. For example the frequency with which a certain atom emits photons of a certain wavelength under various excitation regimes is… or the frequency with which a randomly chosen person from the population will make more than X dollars per year is… or the frequency with which a randomly chosen photo of the surface of the earth will be desert is…

          p values do exactly what you are saying: they measure whether given data could plausibly have come from a particular random number generator. Unfortunately too many of the cases that occur assume this “random number generator” process when such an assumption is very poor scientifically indeed.

        • I want to be clear that I’m not saying,”that is clearly wrong”

          SEV, like Frequentism in general, isn’t even in the category of things which could be wrong.

        • I guess i’d add that all your caveats about needing to make the p value uniformly distributed under the model… mean that I’m particularly ok with p values associated with quantiles of a large dataset. Because there at least you have real data and so a real empirical CDF that is shaped like whatever the stable frequency properties of your data are (assuming there are stable frequency properties)

          “did value X come from whatever model generated dataset of 10000 observations D? its p value is 0.0003 or 0.9997, so probably not” is a meaningful way to screen for “something unusual is happening here” I have yet to find another meaningful way to use p values in real world problems.

        • Daniel: Seems I can’t see a respond button for your actual reply to my long comment, so I’ll reply here…

          OK, we don’t really disagree in theory and maybe not in practice: P is a check on an assumed data generator; I just like to emphasize that while being on (0,1) makes it seem intuitive, in reality it’s very poorly scaled, and transforming to s = -log(p) decompresses P-values to an equal-interval scale with additivity over independent statistics, both desirable in a measure.

          I also like to emphasize that the assumptions of that tested data generator (model) are far more extensive than usually listed, and P-values can be more sensitive to violations of some of those (like hacking) than to the explicitly targeted hypothesis that most books and papers obsess about. Same though for all stats including pure-likelihood and Bayes so why pick on P? Well because of hacking around 0.05. But I bet if likelihood ratios had become the standard there would have been hacking to get past whatever junk cutpoint for that got adopted as the publication criterion.

          Finally, it’s people, not P-values, that choose to start with bad models (by not preliminarily subjecting their models to contextual checking). And again they can and do make the same bad choices when using Bayesian or other methods. As often noted, P-values are just a first pass at problem detection. Unfortunately they are often the last pass as well; I find it hard to get researchers to look at the basic lack-of-fit tests available in their software, let alone more. Take away the P-value as some want, and we’d lose even that much of a check (as weak and misinterpreted as it is).

        • When you are assured of their equality as in these cases, you needn’t be too concerned about confusing the two concepts of evidence; but when you can’t you had better keep them distinct.

          This is one of the ideas I’d like people — especially those inclined to ¯\_(ツ)_/¯ at Bayes vs. classical freq — to take away from my post.

Leave a Reply

Your email address will not be published. Required fields are marked *