So-called Bayesian hypothesis testing is just as bad as regular hypothesis testing

Steve Ziliak points me to this article by the always-excellent Carl Bialik, slamming hypothesis tests. I only wish Carl had talked with me before so hastily posting, though! I would’ve argued with some of the things in the article. In particular, he writes:

Reese and Brad Carlin . . . suggest that Bayesian statistics are a better alternative, because they tackle the probability that the hypothesis is true head-on, and incorporate prior knowledge about the variables involved.

Brad Carlin does great work in theory, methods, and applications, and I like the bit about the prior knowledge (although I might prefer the more general phrase “additional information”), but I hate that quote!

My quick response is that the hypothesis of zero effect is almost never true! The problem with the significance testing framework–Bayesian or otherwise–is in the obsession with the possibility of an exact zero effect. The real concern is not with zero, it’s with claiming a positive effect when the true effect is negative, or claiming a large effect when the true effect is small, or claiming a precise estimate of an effect when the true effect is highly variable, or . . . I’ve probably missed a few possibilities here but you get the idea.

In addition, none of Carl’s correspondents mentioned the “statistical significance filter”: the idea that, to make the cut of statistical significance, an estimate has to reach some threshold. As a result of this selection bias, statistically significant estimates tend to be overestimates–whether or not a Bayesian method is used, and whether or not there are any problems with fishing through the data.

Bayesian inference is great–I’ve written a few books on the topic–but, y’know, garbage in, garbage out. If you start with a model of exactly zero effects, that’s what will pop out.

I completely agree with this quote from Susan Ellenberg, reported in the above article:

You have to make a lot of assumptions in order to do any statistical test, and all of those are questionable.

And being Bayesian doesn’t get around that problem. Not at all.

P.S. Steve Stigler is quoted as saying, “I don’t think in science we generally sanction the unequivocal acceptance of significance tests.” Unfortunately, I have no idea what he means here, given the two completely opposite meanings of the word “sanction” (see the P.S. here.)

P.P.S. Mark Liberman informs me that “sanction” is an example of an auto-antonym. The linked wikipedia page gives several other examples, including “dust” and “oversight.” We could also add the expression that in polite company is called “effing A,” as it represents a strong emotion but can be strongly positive or strongly negative.

25 thoughts on “So-called Bayesian hypothesis testing is just as bad as regular hypothesis testing

  1. The nice thing about hypothesis tests OTHER than those based on p values, though (including non-Bayesian AIC, factor, and others) is that they are based on comparative logic. Comparing the relative fit of two models through likehood can be informative. We have to always keep in mind our assumptions, but this is true of any procedure, as you point out. Every model is "wrong", including those with nonzero effect parameters. When you examine the posterior distribution, you implictly use your eye to do exactly the same comparisons you could do with a hypothesis test. You're comparing posterior values to one another. None of these values are "true" in any sense, but it is still useful to compare them.

    I think when you say that the real question is whether an effect is positive or negative, that's certainly often a valuable question, but does nothing to call into question the usefulness of Bayesian hypothesis tests. Bayes factors or posterior odds for effect sign are perfectly plausible and useful. In other cases, null models are useful to compare against.

    You're stating the case too strongly when you say Bayesian hypothesis tests are "just as bad". Their comparative logic makes them much more useful. P values, on the other hand, test the null in isolation, which is of dubious usefulness.

  2. I think it's clear from the way Bialik deployed the quote that Stigler means sanction in the sense of "approve."

    But yes: you and my mom are right.

  3. It is certainly possible to do BHT with an approximate point null, where the prior on the null is spread out to a degree that is consistent with other information that you have. And, as Berger and Delampady show in their 1987 paper, an exact point null is a good approximation to an approximate point null under some circumstances (specified in the paper).

    That said, the point of hypothesis testing is, presumably, to decide on what to do, what action to take. No reasonable person would do hypothesis testing, say "it looks like the null is false," and stop there. (A lot of unreasonable people do this, however.)

    That is, the proper way to frame this is in terms of decision theory, with losses as well as probabilities. p-values are completely useless and in fact quite wrong in this context. BHT with *appropriate* priors (see above) is just the first necessary step in applying decision theory, to be followed by evaluating the expected posterior loss for each action under consideration.

    So I don't agree that BHT is just as bad as p-values and classical significance tests. As the first step towards a proper application of decision theory, it is much, much better than either of them. Neither p-values nor classical significance tests are appropriate foundations for making decisions.

  4. My quick response is that the hypothesis of zero effect is almost never true! The problem with the significance testing framework–Bayesian or otherwise–is in the obsession with the possibility of an exact zero effect.

    Ooh, I am so using that argument the next time I write about SSVS.

  5. Hey Andrew — tx for the heads up — lots to reply to, most of it not about statistics…

    First, I think you may be getting a teensy bit overwrought here again, "hating" on that one quote. Carl is a pop science writer and when he (or guys like him) call me, I feel very tenderly toward them because my own father was a journalist (spent his whole career in magazine and newspaper writing and editing). So I always talk to these guys and write back carefully worded emails (like Don Berry also apparently did)…. and inevitably the results are disappointing: I get at most one sentence in the final article. Journalists do not ask the experts that they interview for their stories to preview their work (it would be impractical, working on deadlines as they do) so unfortunately this just sort of goes with the territory. It happened to me again the other day, when I spent at least 30 minutes on the phone with a nontechnical sports writer for the Columbus Dispatch who was one of the many to call me before the NCAA tourney got underway. the final article,

    http://www.dispatch.com/live/content/local_news/s

    barely mentions me and does not include all the wonderful, intuitive, easy-to-understand math stuff I gave him. Sigh.

    *Anyway*, I share your concern about "the point null is *never* true", but I agree with the previous commenters that the Bayesian approach still scores over the frequentist approach here since the point null is *not* really required in a Bayesian setting, and offers a way of directly comparing evidence that is model-based, rather than design-based. I also think you're reading much more into Carl's remark than he intended; he's aiming at a mathematically more sophisticated audience than that reading Post Dispatch sports stories, but this is still a "pop science" article, IMHO. Moreover the FDA has decided that point nulls are here to stay, so like them or not, the thing for us to do as applied Bayesians is work within that system and improve the existing science as best we can. That's what the Berrys and I and Team Ibrahim and DJS and Beat Neuenschwander and a whole lot of Bayesians are trying to do right now. I think we're on the right track; CDRH (devices) is on board, CBER (safety) is coming along, and even CDER (drugs) is changing.

    PS So where's the April Fool's Day blog this year? I figured March 31 for you is like the day before the NCAA tourney starts for me. ;) I was gratified to see another journal editor (JCGS) copied my idea of using one of your earlier AFD blogs as the genesis of a discussion paper; I also think my journal got the better article ;)

  6. II think it is mistaken to suppose one doesn't want and need a way to evaluate the inferences warranted and unwarranted by data quite apart from decisions that might be taken. One first, and separately, needs to know what the data indicate about the state of affairs in order to decide what to do about it. Of course, any "inference" could be described as a decision: e.g., I decide that the data indicate radiated water silling into the ocean, but that is merely a verbal point, and not what people mean when they say everything should be reduced to decision-making. I and others regard that conception as abrogating the fundamental purpose for employing evidence to find out about what is the case, rather than what is best to do—where things like utilities should enter. Even where an actual decision is being contemplated, it is undesirable to mix the evidential appraisal with criteria relevant for judging decisions, and any account that precludes doing so is, I would argue, inadequate for scientific inference.

  7. Brad: Believe Andrew has a point that applies to many technical Bayesian publications as well.

    OK "tackle[s] the probability that the hypothesis is true head-on" but rarely does this result in a highly/widely credible posterior – for anything.

    Bayesian approaches provide a lot thats pragmatic (purposeful) but the perhaps formally understandable "salesman's puffing" about the value of _the_ posterior (for everyone?) that one gets – well maybe its times to start losing that.

    (Thinking of using the reworked quote "Every time I hear the word posterior I want to reach for my pistol" in a talk promoting Bayesian approaches some day)

    I tried to clarified some of these concerns in Two Cheers for Bayes in Controled Clinic Trails back in the 90's.

    Anyway, I might not be understanding what Andrew is getting at and I also was waiting for the April Fools' post.

    K?

  8. I'd also like to hear your thoughts on Hoijtink's work on informative hypothesis testing. I heard him give this talk last month and found it interesting.

  9. April, Frank:

    I glanced at Hoijtink's slides. The method could be useful but I don't quite see the point. I'd rather just fit a model directly and then get inferences about any quantities of interest and not worry about hypotheses.

  10. I thought that I made it clear: Sure, you can come to an opinion about what state of nature is likely to be true (specified by a posterior distribution), that's fine. But what are you going to DO about it? Publish a paper? Reject a drug? Accept a medical device under specific circumstances?

    Even if you are only interested in "what is the case," you are still going to take some action about it. Like, publishing a paper that says "that is the case," thus risking (or enhancing) your reputation if ultimately you turn out to be wrong (or right).

    There's always a decision involved, and even one that has consequences.

  11. Bill:

    I agree that decision making is important, and we discuss it in BDA, second edition (chapter 22, I believe). But I don't think point hypotheses are needed to do Bayesian decision analysis.

    Let me state this another way: I can well believe that Bayesian inference with point hypotheses, in the hands of a skilled practitioner, can be a useful tool in decision analysis. But you can definitely do Bayesian decision analysis using straight BDA tools–no point hypotheses, no Bayes factors, just posterior probabilities and loss functions. The decision is the decision of what do to, not the decision of whether to accept or reject a null hypothesis.

  12. What you're missing is that if you have an account that interprets data by using so and so's loss function to go straight to a decision (maximizing utility or whatever), then there is NO distinct evaluation of evidence! You rob others from putting in their loss functions—who says yours is best? The drug company has its loss function, and I have mine, and the result is that evidence simply drops out of the picture! The debate that was thought to be about, say the existence or risk of a certain type, becomes a disagreement between personal losses. This is a highly popular view among personalists and all manner of radical relativists, post-modernists and the like. It's old as the hills, and dangerous.

  13. Mayo:

    I agree that it's good practice to separate data from inference and to separate inference from decision analysis. Each step should be transparent: the client should be able to see where the data came from, to see how the inferences were derived from data, and to see how the decision recommendations were derived from the inferences. Performing a decision analysis–exploring the implications of your inference under a hypothesized loss function–can be useful, and the client can and should always be able to go back and perform a different decision analysis if so desired.

    I never recommended otherwise, nor am I a personalist, radical relativist, post-modernist, etc. I do like to think of myself as old as the hills and dangerous, though!

  14. Andrew: I'm pretty much in agreement with your last comment; but I do see a role for point null hypotheses in decision theory *if* they are understood to be approximations to a state of nature "no significant effect", and compatible with the Berger/Delampady comment (in the sense that the actual prior we have on "no significant effect" can be approximated adequately by a point null), and if that state of nature is one that you need for the decision.

    In addition, I think that various analyses involving Bayesian point null testing do much to undermine the rationale for using classical two-sided tests, e.g., p-values, for measuring the evidence against the "no significant effect" null. I've always found Dennis Lindley's 1957 paper to be quite compelling.

    [Just discovered that the spell checker in this version of Firefox doesn't have 'analyses' in its database!]

  15. I'm so glad you say you agree with me: "that it's good practice to separate data from inference and to separate inference from decision analysis. Each step should be transparent…" I hope your readers take note, for it is at odds with what Jefferys seemed to be saying, and certainly at odds with Ziliak and McCloskey. I never intended you under the umbrella of
    "a personalist, radical relativist, post-modernist, " but had in mind others, notably the "cultists" Z & M. What they fail to realize, ironically, among much else, is that if one takes seriously their idea that losses should be introduced even in interpreting what the data say, then the drug company found at fault was actually doing just what Z&M recommend and endorse! After all, if there's no fact of the matter, but only "oomph" feeling, based on the losses of the data interpreter, then the charge of ignoring or downplaying the evidence goes by the wayside. We are not old, Andrew, but are very dangerous (as are all sham-busters).

  16. Unfortunately, a Bayesian two-sided test (e.g., in Normal iid testing) can be wrong with high or maximal probability: e.g., keep collecting data until the point null is excluded from the confidence interval. Berger and Wolpert concede this in their book on the Likelihood Principle. So it's not at all clear superiority has been shown over the frequentist error statistical (two-sided) test, where this could not happen.

  17. Mayo:

    I'm not talking about EXACT point nulls.

    I'm talking about APPROXIMATE point nulls.

    And I have specifically said that the APPROXIMATE point null has to be adequately approximated by the EXACT one, for calculation.

    If you keep taking data and taking data and taking data, then the EXACT point null will no longer be an adequate approximation to the APPROXIMATE point null at some point. So then you have to choose an appropriate "no significant effect" prior.

    But, I also mentioned "various" analyses of the two-sided frequentist tests. All you have to do is to put a point null on the alternative where the data happen to fall. That is about as supportive of the alternative hypothesis as you can get (and it is cheating). And even if you do this, the p-value (one- or two-valued) still significantly overstates the evidence against the (approximate) point null.

    Furthermore, the Lindley paradox still applies, even if you choose an approximate point null with a fixed "no significant effect" prior, whereas we know that the classical two-sided point null (even if you let it be "fixed and approximate") is guaranteed to reject eventually if you take enough data (this will not happen with the Bayesian test with a fixed "no significant effect" prior).

  18. Hi Andrew & others,

    I agree that point nulls may be a priori improbable in *observational* studies. However, in *experimental* studies, where all factors are tightly controlled by design, point nulls do make sense. Also, many theories specifically state a point null to be true (a law, or something that is invariant across conditions). For instance, Bowers, Vigliocco, and Haan (1998) have proposed that priming depends on abstract letter identities. Hence, their account predicts that priming effects are equally large for words that look the same in lower- and uppercase (e.g., kiss/KISS) or that look different (e.g., edge/EDGE). This account does not predict that the experimental effect will be small; it predicts that it is completely absent. In fact, for theoretical purposes it often does not matter how large an effect is, as long as it is reliably detected. For instance, if priming effects were larger for words that look the same in lower- and uppercase (e.g., kiss/KISS) than for those that look different (e.g., edge/EDGE), this would undermine the hypothesis that letters are represented abstractly, no matter whether the effect size was 100 ms or 10 ms. Of course, it is much more difficult for a 10-ms effect to
    gain credence in the field, but this issue is orthogonal to the argument. Should the 10-ms effect be found repeatedly in different laboratories across the world, the effect would at some point be deemed reliable and considered strong evidence against any theoretical account that predicted its absence. [a more elaborate response to the issue of point nulls is on p. 175 of http://www.ejwagenmakers.com/2010/IversonEtAl2010

    Cheers,
    E.J.

  19. Bill:

    To me, the Lindley paradox falls apart because of its noninformative prior distribution on the parameter of interest. If you really think there's a high probability the parameter is nearly exactly zero, I don't see the point of the model saying that you have no prior information at all on the parameter. In short: my criticism of so-called Bayesian hypothesis testing is that it's insufficiently Bayesian.

    EJ:

    Thanks for the comment. I want to write more on this, but just for now let me say that the issue is not just that effects are never zero but that effects vary. A tiny effect of size "epsilon" could well correspond to an effect of -epsilon for some people and +2*epsilon for another person. In which case the idea of there being a single effect size is off track.

  20. Possibly a nice case study for this post.
    For unrelated reasons I read Sander's paper below and to _me_ it seemed to critically work through some of what is being raised here in the context of evaluating evidence for effects of vitamin E.
    Additionally he seems to have put the point I was trying to raise perhaps somewhat more clearly (quote below).
    Paper:
    S. Greenland. Weaknesses of Bayesian model averaging for meta-analysis in the study of vitamin E and mortality Clin Trials February 2009 6: 42-46

    Quote:
    "Bayesian analysis may mislead readers if it objectifies probability statements. Objectification invites deceptively unconditional claims about probabilities of hypotheses, such as BWN’s claim that ‘Vitamin E intake is unlikely to affect mortality regardless of dose.’ This statement sounds as if the subjective conclusion is a discovered biological fact, when it is only a psychological fact, a posterior belief about the null hypothesis based on priors that others (such as myself) find objectionable."

    K?

  21. Andrew: I agree that the prior on the alternative hypothesis is critical (and everyone who seriously thinks about this knows this).

    But still, the essence of Lindley's paper and of this issue is how we regard a (relatively) sharp hypothesis (and how it, regarded as a "state of nature" that will affect our decisions) to a (relatively) vague hypothesis (ditto parenthetical comment).

    Yes, you have to think seriously about the prior on the alternative hypothesis. Everyone knows this.

    But your original objection was that exact point null hypotheses are implausible. We know this. We know how to evaluate whether they are a reasonable approximation to a plausible (tight but not exact) null.

    So, I am not sure exactly what you are arguing here. Are you arguing that exact point null hypotheses are almost always wrong? I agree. Are you arguing that you have to be careful about how you assign a prior on the alternative hypothesis? I agree also.

    But the Lindley paradox goes further. It says, assign your priors however you wish. You don't get to change them. Then take data and take data and take data… There will be times when the classical test will reject with probability (1-alpha) where you chose alpha very small in advance, and at the same time the classical test will reject at a significance level alpha.

    This will not happen, regardless of priors, for the Bayesian test. The essence of the Lindley paradox is that "sampling to a foregone conclusion" happens in the frequentist world, but not in the Bayesian world.

    So, I don't understand why you say, "my criticism of so-called Bayesian hypothesis testing is that it's insufficiently Bayesian."

    What would make it more "Bayesian" to you? What is lacking in my analysis above?

  22. Bill:

    As I wrote in my discussion of Efron's paper, I suspect some of the differences have to do with what sorts of problems one is studying. I see the virtue of sharp hypotheses in astronomy–either that smudge on the image is a planet, or it's not–and in genetics, and in some other fields. In the problems where I work, in social and environmental sciences, sharp hypotheses of this sort don't come up at all.

    I'm not questioning the mathematics of Jeffreys and Lindley; I'm questioning the relevance of the problem they are trying to solve.

  23. "I completely agree with this quote from Susan Ellenberg, reported in the above article:

    'You have to make a lot of assumptions in order to do any statistical test, and all of those are questionable.'

    And being Bayesian doesn't get around that problem. Not at all."

    Would nonparametric statistics, mixed with frequentist or Bayesian techniques, get around that problem more?

    Justin

Comments are closed.