Deborah Mayo sent me this quote from Jim Berger:

Too often I see people pretending to be subjectivists, and then using “weakly informative” priors that the objective Bayesian community knows are terrible and will give ridiculous answers; subjectivism is then being used as a shield to hide ignorance. . . . In my own more provocative moments, I claim that the only true subjectivists are the objective Bayesians, because they refuse to use subjectivism as a shield against criticism of sloppy pseudo-Bayesian practice.

This caught my attention because I’ve become more and more convinced that weakly informative priors are the right way to go in many different situations. I don’t think Berger was talking about *me*, though, as the above quote came from a publication in 2006, at which time I’d only started writing about weakly informative priors.

Going back to Berger’s article, I see that his “weakly informative priors” remark was aimed at this article by Anthony O’Hagan, who wrote:

When prior information is weak, and the evidence from the data is relatively much stronger, then the data will dominate and . . . a weakly informative prior can be expected to give essentially the same posterior distribution as a more carefully considered prior distribution. The role of weakly informative priors is thus to provide approximations to a more meticulous Bayesian analysis.

This role is important, for two reasons. First, fully thought-through Bayesian analysis is a demanding task, so a quick and simple approximation is always welcome . . . Te second reason why this is important is that the situation of weak prior information is one where it is particularly difficult to formulate a genuine prior distribution carefully. . . .

For this reason, I [O’Hagan] use weakly informative priors liberally in my own Bayesian analyses. . . . But let me emphasise that I would never give to such analyses any of the interpretations of objectivity that Berger would apparently wish them to have. They are approximations to the analyses that I might be able to perform given more time and resources. . . . Everything we do in practice is an approximation in exactly this sense: there is nothing special about using weakly informative priors in this way.

I pretty much agree with O’Hagan here except that I’d go even further and say that in many cases it’s not clear what the correct fully informative model would be. Given the information available in any given problem, I think I would in many cases prefer a weakly informative prior to a full subjective prior even if I were able to construct such a thing.

In any case, Mayo asked for my comments on Berger’s paragraph, and here’s what I have to say:

The statistics literature is big enough that I assume there really is some bad stuff out there that Berger is reacting to, but I think that when he’s talking about weakly informative priors, Berger is not referring to the work in this area that I like, as I think of weakly informative priors as specifically being designed to give answers that are *not* “ridiculous.”

Keeping things unridiculous is what regularization’s all about, and one challenge of regularization (as compared to pure subjective priors) is that the answer to the question, What is a good regularizing prior?, will depend on the likelihood. There’s a lot of interesting theory and practice relating to weakly informative priors for regularization, a lot out there that goes beyond the idea of noninformativity.

To put it another way: We all know that there’s no such thing as a purely noninformative prior: any model conveys some information. But, more and more, I’m coming across applied problems where I wouldn’t want to be noninformative even if I could, problems where some weak prior information regularizes my inferences and keeps them sane and under control.

Finally, I think subjectivity and objectivity both are necessary parts of research. Science is objective in that it aims for reproducible findings that exist independent of the observer, and it’s subjective in that the process of science involves many individual choices. And I think the statistics I do (mostly, but not always, using Bayesian methods) is both objective and subjective in that way. That said, I think I see where Berger is coming from: objectivity is a goal we are aiming for, whereas subjectivity is an unavoidable weakness that we try to minimize. I think weakly informative priors are, or can be, as objective as many other statistical choices, such as assumptions of additivity, linearity, and symmetry, choices of functional forms such as in logistic regression, and so forth. I see no particular purity in fitting a model with unconstrained parameter space: to me, it is just as scientifically objective, if not more so, to restrict the space to reasonable values. It often turns out that soft constraints work better than hard constraints, hence the value of continuous and proper priors. I agree with Berger that objectivity is a desirable goal, and I think we can get closer to that goal by stating our assumptions clearly enough that they can be defended or contradicted by scientific theory and data—a position to which I expect Deborah Mayo would agree as well.

(More from Mayo and others at her blog.)

Glad you have posted this.

I still am curious about the significance of Berger’s claim made in his “more provocative moments”, that using the recommended conventional priors makes them (the default Bayesians who do it right, presumably) more like “true subjectivists”. It would seem that any time the Bayesian used priors that correctly reflected her beliefs (call these priors “really informed by subjective opinions”(riso?), and that satisfied the Bayesian formal coherency requirements, then that would be defensible for a subjective Bayesian. But Berger says that in actuality many Bayesians (the pseudo-Bayesians) do not use riso priors. Rather, they use various priors (the origin of which they’re unsure of) as if these really reflected their subjective judgments. In doing so, she (thinks that she) doesn’t have to justify them—she claims that they reflect subjective judgments (and who can argue with them?).

According to Berger here, the Bayesian community (except for the pseudo-Bayesians?) knows that they’re terrible (Gelman calls them “ridiculous”), according to a shared criterion (is it non-Bayesian? Frequentist?). But I wonder: if, as far as the agent knows, these priors really do reflect the person’s beliefs, then would they still be terrible? It seems not. Or, if they still would be terrible, doesn’t that suggest a distinct criterion other than using “really informed” (as far as the agent knows) opinions or beliefs?

So what’s the criterion for “terrible/non-terrible”, “ridiculous/non-ridiculous”?

Mayo:

In the context of what I wrote above, “ridiculous” means “clearly in contrast with some knowledge that we have.” My point was that a weakly informative prior regularizes and keeps estimates away from the bad zone of nonsensical parameter values.

If there is genuine knowledge, say, about the range of a parameter, then that would seem to be something to be taken care of in a proper model of the case—at least for a non-Bayesian.

Mayo:

I agree. But, in common practice, Bayesians and non-Bayesians ignore lots of prior information. For example, it’s standard practice to run unconstrained logistic regressions, even though in real life we typically are pretty sure ahead of time that coefficients will be less than 5 in absolute value. This gets back to the benefit of weakly informative priors, or regularization. I present these ideas in a Bayesian context but would be happy for non-Bayesians to use them too. One of my frustrations with much of classical statistics is the reliance on conventional procedures such as maximum likelihood, 5% type 1 error rates, etc.

I still don’t see an advantage to using priors to take account of known backgrounds in modeling or interpreting results. How do they regularize? What does this mean? Anyway, I do not advocate “classical statistics”.

Mayo:

1. The term “regularization” does not originate with me.

2. I recognize that you don’t see an advantage to using priors to take account of known backgrounds in modeling or interpreting results. You’re not alone here. A lot of statisticians don’t like Bayesian methods. That’s ok, I just don’t like them telling me and others

notto use Bayes. I got enough of that in my 6 years at Berkeley. I will solve my applied problems using methods that work for me (Bayesian methods are part of that portfolio), and you can use other methods if you’d like.3. For details of the regularization properties of weakly informative priors, see my 2008 published paper with Jakulin et al. and my unpublished paper with Chung et al.

I’m not telling you what to do, I understand your relativistic standpoint: what’s right for me may not be right for you etc. But there are objective properties that one can point to as a result of using one or another method (in taking account of this type of information). Certainly in many applied fields, information about restricting parameter values is not well-captured by priors over the parameter space; in fact in many fields this kind of substantive information enters by restricting the parameter space itself. What one wants to do is test those assumed restrictions before imposing them. Even when the restrictions are validated by the data, imposing the restrictions can lead to less precise inferences. My colleague Aris Spanos is the one who can give actual examples.

I’m not getting this all. Suppose you’re dealing with a set of prior probabilities on the xy plane. So you’re considering prior probabilities of the form P(x,y). Now consider two cases:

(1) Restrict the prior such that P(x,y)=0 unless x=y

(2) Restrict the prior such that P(x,y)=0 unless y=0

The second case you seem to be interpreting as “restricting the parameter space itself” or “selecting a proper model” which is a legitimate “error statistics” way to take the prior information.

While the first case you seem to be claiming is an illegitimate example of Bayesians using priors information to restrict parameter values.

Most people looking at (1) and (2) would conclude that they are fundamentally the same and if it’s ok to use prior information to do (2), as all statisticians do, then it’s ok to use prior information to do (1) as well.

Well I’m not getting at all what you’re saying (Joseph), sorry. Since we’re talking about prior probabilities on parameters, I assume you mean x and y to be parameters? But, for starters, then your claim that “all statisticians” assign priors as in (2) isn’t true (or else I don’t understand the meaning or relevance of (2)). Second, one does not restrict a parameter space by means of priors, in error statistics. Even if it was known that a parameter couldn’t take on a value, say,and spoze I grant your assigning that impossible value a 0 probability, this doesn’t tell you what non-zero degrees of prior probability to assign to all the other possible parameter values. So I think I’m really missing your point; that being the case, I cannot respond.

It’s unclear that testing restrictions before imposing them is “what one wants to do” – in actual practice this process has problems other than occasional lack of precision;

* There are at best a handful of directions of deviations from most models in which one will have useful power to detect model mis-specification. Furthermore, these directions are usually unknown to the researcher. Combined, these phenomena make it very hard to believe that one’s “restrictions are validated by the data”.

* The act of testing restrictions, if not properly accounted for in subsequent inference, also leads to problems; confidence intervals and tests do not perform as they should, they can be seriously misleading.

There are obviously statisticians out there with enough application-area expertise and statistical intuition to, in practice, largely avoid these problems. Andrew is a good example. But producing good analyses is hard work, and done under conditions far from the idealized ones that produce “objective properties that one can point to”.

Mayo,

You don’t seem to realize that “restricting the parameter space itself” is identical to a Bayesian “restricting parameter values”. You might claim that it isn’t, but it is a simple mathematical fact that they are the same, which is going to leave everyone scratching their heads wondering why you think one is good and the other bad.

To make case 2 more concrete, suppose x,y are coefficients in a standard regression. So you get the following:

Error Statistician: I’m going to use a model where the factor associated with y is left out.

Bayesian Statistician: I’m going to use a prior P(x,y) which is zero if y is not equal to 0.

These are mathematically identical. Again the Error Statistician can claim he isn’t doing what the Bayesian is doing and vice versa, but it makes no difference to the mathematics. And all Statisticians do this every time. In any real regression the vast majority of possible factors in the Universe are left out.

To really drive the point home that case (1) and (2) aren’t fundamentally different, consider what happens when x and y have the same units. You can then re-label the parameters space using a 45 degree rotation. Use the following transformation:

u=(x+y)/2

v=(x-y)/2

Now case (1) becomes:

(1) Restrict the prior such that P(u,v)=0 unless v=0.

So the actual statistical problem and data hasn’t changed at all, but case (1) was originally interpreted as an illegitimate Bayesian “restriction of parameters values” has now become the perfectly legitimate Error Statistician “leaving the factor associated with v out of the regression model”.

These are identical operations mathematically. So if you don’t see that independent of anyone’s philosophy, these are the same, then that ends the conversation. If you do see the point though, then please explain why Case (1) is wrong if you use labels x,y but all of a sudden becomes legitimate if you us labels u,v.

Andrew:

Might be wrong but think Berger’s comment was not aimed at Tony’s work here, but to the all too common practice in the applied (non-academic statistical) literature of thoughtlessly using default priors in software packages while also claiming to have explicitly modelled prior background knowledge so that relevant probabilities of unknowns of interest have been successfully arrived at.

These claims are often strongly defended until the authors with the background expertise are asked to clarify how the particular prior probabilities used (perhaps specified by a couple parameter values) are connected to this extensive background knowledge.

And perhaps directly at the makers of statistical software who did not appear to be particularly careful or well read on the challenges involved.

(Also there is preprint of his on using more informative priors in HIV vaccine trials that might provide more recent view of things.)

Also, also noticed your Oct 2001 – slides – liked all the plots.

Delightful exchange. Always interesting to see the tension between the limits of probability matching priors which may correspond to impossible risos and quasi-risos which happen to behave well according to some external criteria.

[…] couple days ago we discussed some remarks by Tony O’Hagan and Jim Berger on weakly informative priors. Jim followed up on […]

[…] some background consider the following from the renowned Philosopher of Frequentist Statistics, Dr. Mayo:If there is genuine knowledge, say, about the range of a parameter, then that would seem to be […]