A couple days ago we discussed some remarks by Tony O’Hagan and Jim Berger on weakly informative priors. Jim followed up on Deborah Mayo’s blog with this:

Objective Bayesian priors are often improper (i.e., have infinite total mass), but this is not a problem when they are developed correctly. But not every improper prior is satisfactory. For instance, the constant prior is known to be unsatisfactory in many situations. The ‘solution’ pseudo-Bayesians often use is to choose a constant prior over a large but bounded set (a ‘weakly informative’ prior), saying it is now proper and so all is well. This is not true; if the constant prior on the whole parameter space is bad, so will be the constant prior over the bounded set. The problem is, in part, that some people confuse proper priors with subjective priors and, having learned that true subjective priors are fine, incorrectly presume that weakly informative proper priors are fine.

I have a few reactions to this:

1. I agree with Berger that improper priors can sometimes work OK. You can’t evaluate just part of a model: the prior distribution, data model, and actual data all fit together.

2. I’m not sure who Berger’s “pseudo-Bayesians” are, but I agree that it’s not a good idea to simply use a flat but bounded prior distribution. As I wrote the other day:

I see no particular purity in fitting a model with unconstrained parameter space: to me, it is just as scientifically objective, if not more so, to restrict the space to reasonable values. It often turns out that soft constraints work better than hard constraints, hence the value of continuous and proper priors.

I find the term “weakly informative priors” to be very useful, and I’d like to use this space to plead with Jim Berger to use the term *not* for bad ideas such as constant densities over bounded spaces but rather for priors that use some general information for the purposes of regularization (“keeping things unridiculous”) in sparse-data settings.

I don’t know if this helps, but when I do a Google Scholar search on weakly informative priors, the top two hits are my own papers, where we indeed get more reasonable and stable estimates than would be obtained using priors that are traditionally considered noninformative.

3. Berger writes that some people have “learned that true subjective priors are fine.” I *don’t* think that “true subjective priors” are necessarily fine! If this distribution is based on bad information, it might be pretty horrible. Or, more to the point, in any case, how do you know that a subjective prior is actually “true”?

I often do find it helpful or even necessary to include prior information in an analysis, but it would be too much to ask me to supply anything like a “true subjective prior.” **The idea of weakly informative priors is to get some (but not all) of the benefit of the prior information while mitigating some (but not all) of the risk of including information that’s not really there**.

To ground the discussion slightly, consider pp. 312-313 of my article with David Weakliem on sex ratios. We took real prior information from the scientific literature and we showed how it could be used in a Bayesian or non-Bayesian fashion (the latter using retrospective design analysis to show that, in the problem at hand, any plausible signal would be overwhelmed by noise). For certain purposes of decision analysis, it might have been appropriate for us to come up with our best subjective prior—really, it would be more of a model than a prior distribution, as it would need to account for our understanding of the measurement of beauty as well as the underlying parameter describing the relation of beauty to sex ratio—but for our purpose of understanding the limitations of the data in this particular problem, it was enough to use weak prior information.

**In summary . . .**

I see two issues here. One is a simple question of terminology. O’Hagan uses the term “weakly informative” for what are commonly considered reference or noninformative priors. Berger considers weakly informative priors to be a particularly bad choice involving flat distributions over a constrained space. I define weakly informative priors as containing real prior information, just less than one might have for any particular problem. Since I actually use these priors in my own applied work, I think I should get to decide the name! In any case, though, it’s probably good to point out the multiple definitions.

Beyond this, I think there’s a statistical point that Berger is missing, which is that there’s something between attempts at noninformativeness (on one hand) and a fully informative prior (on the other). To me, that’s the key idea of weak prior information: as O’Hagan writes, it can often be “difficult to formulate a genuine prior distribution carefully” (or, as I might but it, difficult to set up a full probability model). But at that point we don’t need to retreat to noninformativity; we can take a halfway point and set up a weakly informative prior that includes some, but not all, the substantive information that is available.

I think this is an important concept in Bayesian statistics, which is why I’m speaking on it and writing long blog posts such as this one.

Perhaps a last comment of the year, from me?

First, in other situations rather than exactly here, those striving thoughtfully and hard to solve important problems often don’t read others work as carefully as one might hope. It appears that they see that other’s work as neither particularly helpful nor threatening in what they them selves are currently deeply involved in.

Jim does seem to be clear in his objective [from link below]

“idea of an objective prior which should be maximally dominated by the data”

And his effort to enable others to reproduce his efforts

“Present a simple constructive formula for a reference prior”

Though to me it is not clear why this would be most pragmatic [in the sense of best suited for one’s purpose rather than practical*] and it would seem to involve considerations as set out informally in Greenland and Gustafson’s evaluation of _best suited for one’s purpose_ under repeated draws from sets of wrong priors or more formally Mike Evan’s work on relative surprise inference.

Anyway, now I have some careful reading to do in the New Year starting with

http://www.isds.duke.edu/~berger/papers/formal-def.pdf

CSP – and the very last word from him, after giving up hope of reclaiming any input into the accepted definition of pragmatic [as something practical as any one could ever know what was meant by practical] he choose pragmaticism as it was so ugly no one would likely steal it.

[Apparently greatly expanded upon here http://en.wikipedia.org/wiki/Pragmaticism ]

Keith:

I think reference priors can be useful—we have them all over the place in Bayesian Data Analysis—but one of my problems with the theory of reference priors (as in the Berger et al. article you link to) is that it privileges the likelihood. Why be so sure that your data model is perfect, while also wanting to include no prior information at all? This doesn’t seem realistic to me. In real applications the data model and the prior are both full of arbitrary assumptions.

We can test the assumptions of the “data model”. One of the big problems in even thinking about testing priors is the unclarity as to what they are intending to measure. The data may be accepted without assigning priors to them as well. The very idea that all the claims we make should have probabilities attached seems to me at odds with ordinary thinking (certainly in my case).

Mayo:

It is possible to do all of statistics without probability models. They do some of this in machine learning. I don’t know that all the claims we make “should” have probabilities attached, but it is a useful way of setting up soft constraints, as in my papers with Jakulin et al. and Chung et al.

sorry, we couldn’t.

I once bought a dining room table and wanted to know how long it was. I measured it with a measuring tape, but the measurements had some error to it. So I took several measurements and used to age old Standard Normal error model to get a more accurate estimate for the true length of my table.

To carry out a Bayesian Analysis however, I needed a prior for the length of the table. I decided the prior had to have the form

P(length) = 0 if length > 20 ft

I tested this prior by seeing if the table would fit in my dining room, which was 20 feet long. And it did! So the table had to be shorter.

I guess it isn’t such a “big problem to even think about testing a prior” after all.

It’s completely absurd to say the Data Model is testable while the prior isn’t.

The simplest and most common application of statics to science is in dealing with measurement errors. When measuring lengths for example one uses a model

Measurement_i=length + error_i for i = 1, 2 , ……

Where the error_i are assumed to be iid normally distributed.

This is probably done millions of times a day by students and scientists. And the “iid normal” data model is almost never tested. I’ve never once seen it tested in real life as a scientist (or carpenter) and I don’t know anybody who has. It’s simply assumed without further thought. In special instances when people do try to test the “iid normal” error model it’s usually found wanting. (According to Jaynes this lead Feller to denounce the use of “iid normal” model for measurement errors).

On the other hand, prior information about length is vastly more tested and testable. When I say something like “my dining room table is between 2ft and 10ft in length” that’s far more realistic prior information than anything assumed in the data model.

So the central issue for any philosophy of statistics is not the testability of the data model and certainly no the testability of the prior. The real issue is: “why can we use a untested and probably wrong data model millions of times a day and not run into a problem?”

Any philosophy of statistics which isn’t general enough or advanced enough to handle this simplest of applications to science simply isn’t worth spending any time on.

I am surprised you’d mention an example like this where it’s so clear that no use of a “prior” is needed or part of our ordinary everyday thinking. I want to know if it is going to fit or not, and how far off it’s likely to be from measured results. So I need a way to determine how good a job a given measuring procedure has done in this task, and once I do I’m done. If it’s reliable I can say things like there’s no way it won’t fit. I do this quite a lot these days with respect to my new condo because there are very often aspects of the elevator that fail to be considered by the furniture people. Now I know what other aspects to look for, e.g., the height of the entry door to the elevator, bits of moulding that can get in the way, etc….. I’m interested in delineating the errors that might not have been taken into account in telling me, “no problem, it will fit” and not in anyone’s prior beliefs….

Not quite sure why I can’t reply to Mayo’s comment below so I’m replying here.

Looking for errors along the lines of “you neglected to take into account the molding around the door” is akin to looking for errors in the data model (the likelihood), but certainly we also have prior beliefs, such as for example what makes the scene from the movie Spinal Tap so funny when the 16 inch stonehenge drops down onto the stage.

Anyone who saw a drawing for a stage prop that had a measurement of 16 ” (double tick mark) would surely think to themselves “did they really want only sixteen inches high, or is this a case of a simple typographical error?”. The fact that they manufacture and deliver and use a 16 inch stonehenge, compensating for it by hiring a few under-sized dancers is what makes the scene so funny.

So a measurement procedure which produces results like a 16 inch stonehenge requires a lot of repeat confirmation before being accepted over our prior belief that stonehenges are generally more like 16 feet tall.

We can’t test all the assumptions of the data model. Google “untestable assumptions” to see aspects of models that have to be based on something other than the data at hand.

I had the immense pleasure of taking a course on Bayesian Statistics by Dennis Lindley in the 1980s. Lindley, the arch-Bayesian’s Bayesian, demonstrated very clearly the “incoherence” of frequentist statistics in that the latter violated a set of indisputably reasonable axioms. But the axioms also indisputably ruled out improper priors. But in one particularly vexing calculation Lindley was confronted with a stubborn nuisance parameter “u” that had to be subdued. The solution? “We’ll just multiply by an integral sign and toss in a du and make it go away.” In addition to a good laugh, I gained the insight that the utility of coherence, although high, is not infinite (unlike the integral).

What’s really the practical difference between an improper prior and say a cauchy prior with an interquartile range 1 billion times larger than anything anyone could reasonably expect? That is, provided you have at least a few data points so the likelihood has some effect.

Daniel:

I recommend a weakly informative prior distribution, for example an interquartile range of 5, not 1 billion.

Daniel: I think Robert’s (Lindley’s) point is that even though you can make stubborn nuisance parameter’s go away, they can and often will come back to haunt you.

As someone in an earlier comment pointed to the omnipresence of Simpson’s paradox (that neglected dimension not being properly evaded) you might find the Bayesian version helpful to think about.

http://pubs.amstat.org/doi/abs/10.1198/tast.2010.10006