It not necessary that Bayesian methods conform to the likelihood principle

Bayesian inference, conditional on the model and data, conforms to the likelihood principle. But there is more to Bayesian methods than Bayesian inference. See chapters 6 and 7 of Bayesian Data Analysis for much discussion of this point.

It saddens me to see that people are still confused on this issue.

10 thoughts on “It not necessary that Bayesian methods conform to the likelihood principle

  1. Thank you for the comment! I recently started writing that “Bayesian conditionalization obeys the Likelihood Principle” rather than that “Bayesian methods” do so when I discuss this issue. I must have written this post before I realized that I was being sloppy on this point. I can certainly see why it is frustrating for you to see me identify Bayesian methods with Bayesian inference, given the work you have done to advance Bayesian methods in the broader sense. Mea culpa!

  2. I fail to see who remains confused by this issue, or rather, I’m not sure which issue you think people are confused on. Presumably, rereading Greg’s note, the confusion is on whether Bayesian methods = Bayes’ Rule. But that is not confusion regarding the LP. Just wanted to get that straight. That said, I’m a little surprised you accept the LP, even given the model, since reference Bayesians do not.

    • Mayo:

      Bayesian inference is step 2 of the 3-step process of Bayesian data analysis: (1) model building, (2) inference conditional on the model, (3) model checking. It saddens me that some people still associate Bayesian methods with step 2 alone.

      • Bill, a “reference Bayesian” is an “objective Bayesian” who uses reference priors. The reference prior algorithm uses the form of the sampling distribution to derive the form of the prior; the reference argument involves asymptotics and, at least in that sense, considers hypothetical data in a way reminiscent of frequentist methods that violate the LP. I’m not so sure the reference argument formally violates Bayesian principles. But it does leave one uncomfortable (though perhaps the discomfort should be with the idea of a default “uninformative” prior).

        E.g., the reference prior for the probability parameter in a binomial distribution is different from that for the probability parameter in a negative binomial distribution. I.e., the reference prior for the probability of heads when analyzing the outcome of seeing n heads in N trials is different if you fixed N in advance (n is “random,” binomial case), or if instead you decided to flip until you saw n heads (N is “random,” neg. binomial case).

        Look for a 1998 “catalog of noninformative priors” (hate that phrase) by Yang and Berger (your sometimes collaborator!) for more examples. The theory behind reference priors has matured a lot since then; the definitive reference is a recent paper by Bernardo (inventor of the idea) and Berger, “The formal definition of reference priors.”

        • Tom,

          In my formulation above ((1) model building, (2) inference conditional on the model, (3) model checking), this suggests that the data model (not just the likelihood but also any data-dependent stopping rule, etc.) can influence step 1 of Bayesian data analysis. I still think that the likelihood principle trivially holds for step 2, but as Mayo would say, such a statement is so trivial that it cannot be what is meant when people talk about the likelihood principle.

          I am happy to say that steps 1 and 3 are important parts of Bayesian data analysis, hence the entire data model matters for me, not just the likelihood.

        • Andrew, I agree with your viewpoint, which is partly why I wrote that I don’t think reference priors are necessarily in trouble from a Bayesian point of view for this dependence on the sampling distribution. After all, the “C” (say, for “context;” I suppose “I,” for “information,” is more conventional) in p(H_i|C) is the same “C” in p(D_obs|H_i,C), with the product of these forming the numerator in the usual form of Bayes’s theorem. That is, the contextual/background information being used to assign the likelihood function (via the sampling distribution for D) is also available for assigning the prior. I still can sympathize with those who intuitively feel some parts of C should not be affecting the prior in particular cases (e.g., a supposedly “noninformative” setting). But the discomfort doesn’t mean there is a problem in principle. The problem might in fact be with the intuition.

          “… such a statement is so trivial that it cannot be what is meant when people talk about the likelihood principle.” The Berger & Wolpert *LP* book played a role in motivating me to adopt Bayesian methods for astronomy problems back when I was a grad student. That said, there’s something a little uncomfortable about the principle, or at least about summaries of it that imply that it’s bad to consider data that you might have seen but did not in your inferences. p(D_obs|…) appears in Bayes’s theorem, not just a bare likelihood kernel (for lack of a better term). It’s a probability distribution over the sample space, so you have to at least define the sample space–and thus think about other possible values for the data–in order to formulate a Bayesian inference problem.

          I dimly recall Jaynes struggling over this someplace, saying there should be a way of assigning p(D_obs|…) without having to be explicit about other possible values for the data, but to my knowledge he never made any headway with this.

          PS: The perspective on BDA you are emphasizing here seems to me to have much in common with Roderick Little’s “calibrated Bayes” viewpoint. I’d love to see you take that up in a future post and discuss where your viewpoints agree and/or disagree. (Vested interest: I briefly “advertised” calibrated Bayes to astronomers, along with some of your perspectives on statistical pluralism and multiple testing, in a recent proceedings paper, arXiv:1208.3035.)

        • Hi Tom, thanks, that is what I thought it meant, I just had not run across that term before. I’m aware of the fact that the reference prior on binomial data differs from the one on negative binomial data.

          I don’t think that use of a reference prior is in conflict with the LP. The LP just tells us how the observed data (should) enter into the problem:

          “In making inferences or decisions about the parameters
          $latex \theta$ of a problem after observing data x, all relevant
          information about $latex \theta$ is contained in the likelihood
          function for the observed x. Furthermore, two
          likelihood functions contain the same information
          about $latex \theta$ if they are proportional to each other (as
          functions of $latex \theta$).” (J.O. Berger, Statistical Decision
          Theory and Bayesian Analysis, Second Edition, p. 28)

          Obviously we have to have a prior, and the prior has (by definition) prior information about $latex \theta$. If we devise a prior by using the sampling distribution, that is not using the observed data x in any way since we are going to integrate over x to produce that prior. Any prior that we use is going to influence the posterior on $latex \theta$. So it remains true that the way the observed data x enter into the inference is still solely through the likelihood function.

          I’m fully in agreement with what Andrew says here about the importance of steps 1 and 3.

Comments are closed.