Skip to content

About a zillion people pointed me to yesterday’s xkcd cartoon

I have the same problem with Bayes factors, for example this:

Screen Shot 2015-01-27 at 4.42.52 PM

and this:

Screen Shot 2015-01-27 at 4.45.03 PM

(which I copied from Wikipedia, except that, unlike you-know-who, I didn’t change the n’s to d’s and remove the superscripting).

Either way, I don’t buy the numbers, and I certainly don’t buy the words that go with them.

I do admit, though, to using the phrase “statistically significant.” It doesn’t mean so much, but, within statistics, everyone knows what it means, so it’s convenient jargon.

P.S. Kruschke had a similar reaction.


  1. EJ Wagenmakers says:

    Andrew, what classification scheme do you propose? (assuming you set out to construct one, which you probably don’t want to do :-))
    I think the scale is somewhat arbitrary, but I do like the notion that BFs < 3 are not worth much; if this isn't made clear through the verbal label people will make a big fuss about a BF of 2, which is the evidence that a single white ball gives for urn A (all white balls) versus urn B (50% white balls, 50% black balls).


    • Andrew says:


      I think it depends on the example. My problem with the Bayes factor is that in practice it just about always depends crucially on conventional aspects of the prior distribution, for example weak priors on location parameters or whatever. In settings where the Bayes factor represents real probabilities, I’d just report the probabilities as is, without trying to give them words.

      For example, in what settings is 100:1 odds “strong” and in what settings is it “very strong”? It will depend on context. I just don’t see the words as adding anything to the numbers.

      • Martha says:

        Yes — so much depends on context. Using one-size-fits-all rules avoids the responsibility to consider context.

        • Rahul says:

          Can we have field-specific guidelines? e.g. take the word “heavy”. 1 kilogram is heavy cellphones but light for a bicycle.

          Can there be some similar heuristic describing Bayes factors say in ESP studies, versus cancer trials versus particle physics.

          • Martha says:

            I think even field-specific guidelines are too one-size-fits-all. One needs to take into account consequences of deciding one way or the other, the measure used, the study design, how many other inferences are done using the same data, etc. No ethical way to avoid case-by-case decisions, accompanied by the reasoning behind the decision.

            • Rahul says:

              I’m worried about topics where we shy from formulating guidelines. If we say that each and every case must be decided manually using the expert’s discretion it seems more of an art than a science.

              Context is fine, but in a scientific scenario one must ideally find a way to codify the variables in the context, right? Heuristics is what we try for in these Tables.

              • Martha says:

                “Context is fine, but in a scientific scenario one must ideally find a way to codify the variables in the context, right?”

                I say, “Wrong”! Good science is all about context. (See also my comment below about the Gigerenzer article.)

              • Rahul says:


                Here’s what I meant: e.g. Take Apgar Scores from medicine. Yes, there’s a lot of context & complexity in assessing the health of a new born baby & sure the Apgars cannot capture all of it. And I bet there are plenty of cases where the Apgar’s are wrong.

                Yet, there is still utility in formulating a context-specific heuristic metric like that. Medicine seems to have a lot of similar heuristics e.g. Glasgow Coma Scale, Bishop Score etc.

                I’m not trying to deny context. I’m trying to emphasize the utility of heuristics & quantitative guidelines.

              • Elin says:

                I think one important thing to consider is that something like Apgar is not testing a hypothesis or being used in the context of research. Remember, though, this is why it’s important that nurses (and doctors) understand statistics :).

              • Christian Hennig says:

                I think that both Rahul and Martha are right in some sense, and that this illustrates a basic paradox in science. We need both flexibility and some kind of standardisation. Science needs the point of view and input of the creative expert, but it aims at producing knowledge that is generally acceptable and reproducible, and it therefore needs to reflect on general rules and guidelines in order to be accessible for (pretty much) everyone.

              • Rahul says:

                @Christian Hennig:

                Excellently summarized!

              • Martha says:


                The Apgar score was developed in a specific context for a specific purpose, namely, to have a quick way of deciding when a newborn needs quick special medical attention. The development of guidelines was the goal.

                But not all scientific scenarios have the goal of developing guidelines to accomplish a specific aim in a specific context. The types of guidelines being discussed in this thread are of a more general type, not dependent on context, and not with regard to a goal that only makes sense in a specific context.

  2. Anonymous says:

    Imagine the Chairman of the Joint Chiefs of Staff reporting to the President during a nuclear standoff with the Soviets:

    Version 1: “Sir, if we repeated this nuclear standoff a million times then in only 4% of those confrontations would the Russians would be as mad, or madder, at us than they are now. This represents statistically significant evidence that they are hopping mad.”

    Version 2: “Sir, Odds are 24 to 1 we’re about to accidently start a Nuclear War.”

    • Bill Jefferys says:

      Decision theory anyone? What’s the loss function for starting a nuclear war?

      • Chris G says:


        I will add that when it comes to fire-control decisions you really want to get the weighting of Type I and Type II errors right.

        • Anonymous says:

          when it comes to anything that matters you really want to get the weighting of type I and type II errors right. Why wait for a nuclear war?

          • Andrew says:


            Quantifying gains and losses, yes. See the decision analysis chapter of BDA for some examples. Type 1 and type 2 errors, no.

            • Anonymous says:

              Andrew – maybe I’m missing something but I’ve always thought of losses as just errors + cost. Maybe you’re getting at magnitude/sign errors as being more important? I would think it comes down to whether the problem is classification or estimation.

              • Andrew says:


                When I do decision analysis, the losses and gains are defined in the context of the problem. I’m not talking about things like squared error loss, I’m talking about losses such as dollars and lives. Again, I refer you to the decision analysis chapter of BDA.

              • Bill Jefferys says:

                I agree with Andrew’s comment (no further comments allowed to his comment so I have to comment on what Andrew wrote).

                I was explicitly thinking of the losses that Andrew mentioned. Not artificial “squared error” losses, but real losses that affect real people.

                Andrew: +1

          • Chris G says:

            > Why wait for a nuclear war?


      • Anonymous says:

        just poking fun at the absurd suggestion that “Bayes Factors” are as bad as “statistical significance” for clear communication. You can have a lot of fun along these lines. For example

        A girlfriend talks to her boyfriend about a potential pregnancy:

        Version 1: “sweetheart, if you had a 100,000 girlfriends who were as late as I am then 6% of the time they wouldn’t be pregnant. We cannot reject the hypothesis I’m not with child. So don’t worry.”

        Version 2: “odds are better than 15 to 1 I’m pregnant”

  3. Matt Levinson says:

    I fully agree with the issues with these kinds of universal scales for statistics such as p values and Bayes factors. But surely you can’t mean categorically you don’t buy the numbers? In the hands of a knowledgeable analyst, and with all the myriad caveats and qualifications, a summary statistic such as a p value of 0.001 or a Bayes factor of 60 surely means something different than, say a p value of 0.35 or a Bayes factor of 2, respectively. Of course it doesn’t mean you’ve decisively proven the hypothesis you’re hoping is true, but it’s evidence for something. And that something could possibly be your hypothesis.

    • Andrew says:


      Sure, all numbers in statistics are model-based, and we almost never believe the model. One of my problems with p-values and Bayes factors is that they’re so indirect. The p-value, as we’ve discussed, is problematic as a data summary because it depends on what we would’ve done had the data been different. The Bayes factor is problematic because it is used to answer a question—What is the probability that model 1 or model 2 is true—that is typically meaningless.

      I have a lot less problem with numbers that are more direct answers to scientific or engineering questions, for example predictions or estimated averages or differences or parameters in a model.

      • Anonymous says:

        “What is the probability that model 1 or model 2 is true—that is typically meaningless.”

        This is simply bunk of high order.

        Your preferred method for evaluating a model is post-posterior checks goes something like this. Take some “Data” and “Model” and get a posterior for mu. Call it P(mu|data, model). Then use observed values mu* to see if P(mu* |data, model) is small (or equivalently if simulations from P(mu|data, model) ‘look like’ mu*). If they’re small then move on to a better Model.

        The thing is it’s a trivial consequence of Bayes theorem that P(mu* |data, Model) is proportional to P(Model |mu*, data). In other words all your preferred method does is reject models which have a small Bayes Factors conditional on ALL data.

        So the probability of a model is at least as meaningful as your post-posterior checks are.

        Once again you make the classic Statistician’s Fallacy: “if I can’t see it’s meaning, it must be meaningless”

        • Andrew says:


          As I’ve written in my articles on the philosophy of statistics, I recognize that my own philosophy and methods are incomplete.

          • Anonymous says:

            From Abraham Wald ( Bayes Solutions of Sequential Decision Problems):

            “In most applications, however, not even the existence of an a priori distribution can be postulated”

            What is it with Statisticians thinking “If I can’t do it, it can’t be done”?

          • Anonymous says:

            (diff anon) Andrew – are you allowing that posterior checks and bayes factors are mathematically (if not interpretationally) equivalent?

            • Andrew says:


              No, posterior predictive checks and Bayes factors are not mathematically equivalent. They have many differences. For one thing, PPC is based on a single model; Bayes factors requires 2 models. For another, PPC is not sensitive to certain aspects of the prior distribution that Bayes factors are sensitive to. PPC and Bayes factors are completely different.

              • Anonymous says:

                original anon,

                Andrew you can say whatever you want. It’s a mathematical fact that posterior P(m*|Model, data), low values of which are being used to weed out bad Models, is proportional to P(Model | m*, data).

              • Anonymous says:

                (diff anon again)

                @original anon, that may be so, but isn’t that like saying p-values are (often) proportional to the posterior probability of the null?

                In fact isn’t it analogous?

              • Anonymous says:

                Well there is a difference between posterior checks and the probability of the model conditional on ALL data. The difference arises when the data used to fit model strongly confirms the model. In that case the model can be relatively week at predicting m* and still be ok.

                The more I look at this it’s clear that Bayes is doing the right thing here while the simple post-posterior check is not. Essentially what’s happening is that Bayes theorem is treating all data on an equal footing. You may think of Y as the dependent variable and x_1, x_2,… as the independent variables, but there’s nothing in principle stopping you from interchanging their roles. In essence Bayes picks a model which is good no matter what you choose at the independent and dependent.

                Something very similar happens when you look at the Bayesian version of Cross Validation. How you subset the data is inherently arbitrary. Intuitively, people recognize that leaving one data point out and using the rest to predict it isn’t a very good check by itself. In practice people usually cycle through and leave each data point out in turn and get an overall predictive error rate which includes the result of all those trials. In essense all data is on an equal footing. The Bayesian version again inherently treats all data points on the same footing and automatically does essentially the same thing.

                The bottom line is that Bayes is the answer. If it doesn’t appear to be the answer, then that’s just because you haven’t understood it well enough. You should get busy understanding it better. Anyone advocating ad-hoc solutions, which inevitably work some of the time but fail miserably in others, is just setting Statistics back in the long run versus what would be achieved by developing real Bayes further.

              • Nick Cox says:

                Here and elsewhere different people writing as Anonymous have to clarify who they aren’t. It would be easier for you, and us, if one of you changed your non-name! The handle “entsophy” appears to be available.

  4. Roger H says:

    Surprised you use the phrase “statistically significant.” It was all-but-banned in my former department, and still makes me flinch inwardly whenever i read or hear it

  5. Anonymous says:

    Hey Andrew, whenever you give a lecture and communicate your distain for Bayes factors, you sould show that clip from The Empire Strikes Back:

    C3P0: “The possibility of successfully navigating an asteroid field is approximately 3,720 to 1”

    Han Solo: “Never tell me the odds!”

  6. dmk38 says:

    I agree it is really unhelpful to use verbal “strength of attitude” renderings, which encourage mistake of viewing statistics as an alternative to thinking rather than a tool for thinking.

    But I’m having trouble getting why you don’t like Bayes Factor.

    Do you feel this way about likelihood ratios in general? If a medical screening test has a “true positive” rate of 95%, you don’t think it’s useful to think of this as being 19x more consistent w/ the hypothesis that the individual tested has the condition? Does that information “depends crucially on conventional aspects of the prior distribution”? I assume not; the whole point of 19x here (of likelihood ratio anywhere) is that we are characterizing the evidentiary weight of the result in a manner that cordons it off from priors. Precisely b/c that’s what the test result is, one can’t report “the probabilities as is” that someone has the disease. Indeed, my sense is that if convention were to report results like this as a likelihood rato — this result is 19x more consistent with your having the disease than not …– people would be much *less* likely to make the mistake of thinking that the probability someone has the condition in this case is “95%.” The concept of “likelihood ratio” really is critical for teaching people to think logically about weight of evidence!

    If you agree w/ me so far, then why the animus about Bayes Factor? It is trying to get people to think about the weight of the evidence in LR terms rather than try to see some study result as “conclusively” establishing something. Indeed, it is making crystal clear that one can “accept” the study w/o even having to think the hypothesis that it most supports is more likely than not to be true, etc. (b/c that *does* depend on priors).

    Likely I missed a critical previous episode — refer me to it if it answers the sort of questions I’m posing.

    • Andrew says:


      In a setting with true discrete alternatives (such as presence or absence of a disease), I think the likelihood ratio (or, equivalently, Bayes factor) is useful (even though I don’t find those “strength of evidence” phrases). My problem with Bayes factor is the same as my problem with so-called type 1 and type 2 errors: I think these ideas are typically applied to problems where there are not clear discrete alternatives, and where the null/alternative and false positive/negative dichotomies are artificial.

      • dmk38 says:

        Okay, thx. That helps– b/c reminds me that this is bound up w/ disagreement w/ idea (Raferty’s in particular) that there should be some automated procedure or index for model selection ….

        But I’m wondering how to understand the point about importance of “discrete alternatives” & LRs.

        To ground this, consider the interesting discussion & data at on recent announcement of “2014 being warmest yr on record ….”

        Actually, it’s pretty clear 2014 is less likely than not to have been the warmest yr on record. But it is also more likely to have been the warmest yr than any other.

        Indeed, as Revkin notes, based on NOAA’s data in this table, 2014 is about 2.5x more likely than 2010 to have been warmest & 10x more likely than 1998.

        If one looks at the NASA data in same slide, then 2014 is only about 1.5x more likely than 2010.

        I find that really helpful & interesting; much more interesting than if someone had concluded, “sorry we can’t reject at p < 0.05 the null hypothesis that '2014 was no warmer' than any of the yrs for which there is temperature data …" We can't. But we *can* still say a lot of pretty interesting things about the data. And we can characterize the "weight of the evidence" that 2014 was warmest — in manner that reflects how how willing one should be to bet on the proposition. Then on the basis of that, whoever is using the data for whatever purpose can do whatever the heck he or she wants! But the information that was really in the data was delivered — thanks to using LR to get at it rather than just using some knucklehead type-1 threshold to determine whether one is "allowed" to make a (misleading) claim based on data.

        Another way to do this would be to juxtapose probability density distributions & then eyeball the overlap. That would be much more informative than, say, providing point estimates & confidence intervals.

        But what one would be “seeing” in that case is that if one were to repeatedly to draw “yrs” from the combined pool, one would get 1.5 2014s for ever 2010. Still LR in spirit.

        Would you say here the alternatives were “discrete”? I might be misunderstanding you, but I don’t think that the data here were discrete or had to be. If we had some unlimited number of previous yrs of temperature data & still, I’m surmising that a smart statistician could have come up w/ some tractable way of saying how much more consistent the avaialable evidence was with 2014 being “warmest” overall or “warmer” than some specified set of yrs…

    • Bill Jefferys says:

      If I were faced with a decision about what to do after getting a medical screening test that came up positive, I’d have to consider all sorts of things…how bad is it if there is a false/true positive and I elect to treat/not to treat? Even if I know the probability that I actually had the condition, the cure could be worse than the disease. That’s what loss functions are for.

      At my age, for example, the PSA test is no longer recommended, just because the probability of my dying of prostate cancer (should I be screened and biopsied positive) is such that I’d probably die of something else before it became an issue. Even the biopsy after an initial positive on the PSA test is problematic, it’s invasive and can result in serious side-effects in a small number of cases.

      In questions like this, the NNT is much more relevant when even considering whether to have the screening test, as is the NNH. Not to mention the actual harms and benefits (as measured by the patient’s loss function, even if informally elicited).

      NNT and NNH are considerably more useful statistics for this sort of test, in my view, as well as being easier for non-statisticians to understand (when presented well, as in this recent article):

      Personally I have instructed my physician NOT to do the PSA test (which was never designed to be a screening test by its inventor, Richard Ablin, who has spoken out strongly against its use in this way). My decision might have been different had there been a history of aggressive prostate cancer in my family since that would affect my prior. But for most men with no special risks, the current recommendation not to have this test seems correct.

  7. You could at least have at your back of your mind the fact (I think first pointed out by Jack Good) that the expected Bayes factor against a true hypothesis is one. Since you then have the expectation of a non-negative value, the probability under the null hypothesis of observing a Bayes factor of F or more is at most 1/F. In fact, I wouldn’t be surprised to find that you could set up model situations where no extra information was available and this fact described the evidence available pretty accurately.

  8. Sherman Dorn says:

    My personal table (back to p-values):

    < .001 Get straight
    < .01 Go forward
    < .05 Move ahead
    = .10 Whip it good

  9. I said similar things about Bayes factors in my blog post on the 26th, but added that a focus on posterior credible interval and ROPE (region of practical equivalence around null) focuses users on actual magnitude of effect and its uncertainty instead of on black/white “significance.” Post is here:

Leave a Reply