Data-driven Vague Prior Distributions

I’m not one to go around having philosophical arguments about whether the parameters in statistical models are fixed constants or random variables. I tend to do Bayesian rather than frequentist analyses for practical reasons: It’s often much easier to fit complicated models using Bayesian methods than using frequentist methods. This was the case with a model I recently used as part of an analysis for a clinical trial. The details aren’t really important, but basically I was fitting a hierarchcal, nonlinear regression model that would be used to impute missing blood measurements for people who dropped out of the trial. Because the analysis was for an FDA submission, it might have been preferable to do a frequentist analysis; however, this was one of those cases where fitting the model was much easier to do Bayesianly. The compromise was to fit a Bayesian model with a vague prior distribution.

Sounded easy enough, until I noticed that making small changes in the parameters of what I thought (read: hoped) was a vague prior distribution resulted in substantial changes in the resulting posterior distribution. When using proper prior distributions (which there are all kinds of good reasons to do), even if the prior variance is really large there’s a chance that the prior density is decreasing exponentially in a region of high likelihood, resulting in parameter estimates based more on the prior distribution than on the data. Our attempt to fix this potential problem (it’s not necessarily a problem if you really believe your prior distribution, but sometimes you don’t) is to perform preliminary analyses to estimate where the mass of the likelihood is. A vague prior distribution is then one that is centered near the likelihood with much larger spread.

We estimate the location and spread of the likelihood by capitalizing on the fact that the posterior mean and variance are a combination of the prior mean and variance and the “likelihood” mean and variance. Consider the model for multivariate normal data with known covariance matrix, and a multivariate normal prior distribution on the mean vector:

y| μ, Σ ~ N(μ, Σ)
μ ~ N(μ0, Δ0).

The posterior distribution of μ (where n represents the number of observations) is:

μ|y, Σ ~ N(μn, Δn), where

μn = (Δ0-1 + nΣ-1)-10-1μ0 +nΣ-1y)

Δn-1 = Δ0-1 + nΣ-1.

Here y and Σ/n represent what I’m calling the likelihood mean and variance of μ. If we were unable to calculate them directly, we could do so by solving the above two equations for y and Σ/n, obtaining

Σ/n = (Δn-1 – Δ0-1)-1

y = (Σ/n) Δ0-1n – μ0 ) + μn.

A vague prior distribution for μ could then be something like N(y, Σ) or N(y, 20Σ/n). For more complicated models you could do the same thing. Let (μ0, Δ0) and (μn, Δn) represent the prior and posterior mean vector and covariance matrix of the model hyperparameters. First fit the model (with multivariate normal prior distribution on the hyperparameters) for any convenient choice of (μ0, Δ0), then use the equations above to estimate the location and spread of the likelihood for these parameters. This approximation relies on approximate normality of the hyperparameters. In large samples this should be true; in smaller samples transformations of parameters can make the normal approximation more accurate.

It’s also possible to check the accuracy of the likelihood approximation: Fit the model again using the estimated likelihood mean and variance as the prior mean and variance. The resulting posterior mean should approximately equal the prior mean, and the posterior variance should be about half the prior variance, if the likelihood approximation is good. If not, the process can be iterated: fit the model, estimate the likelihood mean and variance, use these as the prior mean and variance, fit the model again and compare the prior and posterior means and variances. Repeat until the prior and posterior means are approximately equal and the posterior variance is about half the prior variance. From here, a vague prior can be obtained by setting the prior mean to the estimated likelihood mean the prior variance equal to the estimated likelihood variance scaled by some large constant.

I’ve tried this method in some simulations and it seems to work, in the sense that after iterating the above procedure a few times you do obtain an estimated likelihood mean and variance that, when used as the prior mean and variance, lead to a posterior distribution with the same mean and half the variance. With simple or well-understood models, there are surely better ways than this to come up with a vague prior distribution, but in complex models this method could be a helpful last resort.

5 thoughts on “Data-driven Vague Prior Distributions

  1. First, a minor comment: we are trained to speak of effects of causes, not causes of effects. But that being said, I would think that a big reason why you, like me, tend to prefer Bayesian analyses is that you were trained that way, are expert in them, and from your experience and training you understand them better than other statistical methods.

    Now, on to your main method: I suspect that, as with other improper Bayes methods, it can be interpreted as an approximation to some hierarchical model. (For example, a model in which the "prior parameters" are estimated from data can be done hierarchcially, as in chapter 5 of our book. For another example, a robust method in which outlying observations are downweighted can be fit using a t model, which can be viewed as a hierarchical normal model in which observations have different variances, and the "robstification" procedure is a crude way of estimating and using these data-level variances.)

    For your problem, perhaps one way to start is to abandon the concept of the "likelihood" and, instead, for the first step, use the posterior density under some weakly-informative prior distribution. Perhaps this will take you toward the goal of setting it up hierarchically, which I suspect will make the method easier to understand.

  2. Is there something wrong with the dirt-simple procedure of drawing an envelope around the data and then making the variance six times the size of the envelope or something?

  3. Dsquared,

    Sam can answer this one better than I can–but I think the challenge is in generalizing your "dirt-simple" procedure to the case of indirect data. To start with, in a regression context, your prior distribution for "beta" isn't necessarily scaled to the variance of the data. Then you can get to logistic regressions, nonlinear predictors, etc. I think the idea is to come up with a general procedure that allows this sort of weakly-informative model.

  4. This method sounds reasonable, but I am afraid that from a Bayesian perspective, specifying a prior distribution after the model fitting is seen as cheating. Prior distributions are in principle specified before observing the data. I wonder if this combination ML-Bayes is not some sort of "empirical Bayes", but I would agree that it might prove useful in many cases, provided the data convey substantial information about the parameter. The idea of considering much larger prior variances is appreciable. One would also need to make sure that the likelihood surface is well-behaved (for example no multimodality). But still, one needs to make his paradigm clear: frequentist or Bayesian. Come on, no body can pretend to be "not going around having philosophical arguments"…that would simply make science as boring as politics.

    Coming back to your problem, if, as you report, small changes in the parameters yield substantial changes in the resulting posterior distributions, then the problem might be that some parameters are only poorly covered by the likelihood. This problem is commonly known as weak identifiability. Weakly identifiable parameters are characterised by inflated posterior variances. In general such a problem results from the structure of the data at hand. If that is the case, your likelihood "short-cut" would not help anyway since there is just not suitable data to inform the parameters. A sensible way of dealing with weak identifiability is to introduce suitable informative priors for the poorly informed parameters. Otherwise, one requires suitable data. In the absence of suitable data and suitable prior information, I am afraid that no miracle can be done.

  5. Thank you all for the comments. I am steeped in a similar situation where the sample size is small and the posterior densities improve in shape (become more Normal) following small modifications of the prior (e.g. when the posterior density is skewed, I inteprete it as indicating that the prior should be moved to the point where there is more mass). I am not sure if this "snooping" attitude is welcomed at all.

    Now, I am faced with a more challenging situation and have no clue how to resolve it. My posterior density turns out to be uniform or flat meaning every parameter value in the interval is equally likely. What can I do to ensure that I get a Normal-looking posterior density? so far tuning the prior did not help an inch.

    Thanks in advance for any suggestions.

Comments are closed.