Nick Firoozye writes:
While I am absolutely sympathetic to the Bayesian agenda I am often troubled by the requirement of having priors. We must have priors on the parameter of an infinite number of model we have never seen before and I find this troubling. There is a similarly troubling problem in economics of utility theory. Utility is on consumables. To be complete a consumer must assign utility to all sorts of things they never would have encountered. More recent versions of utility theory instead make consumption goods a portfolio of attributes. Cadillacs are x many units of luxury y of transport etc etc. And we can automatically have personal utilities to all these attributes.
I don’t ever see parameters. Some model have few and some have hundreds. Instead, I see data. So I don’t know how to have an opinion on parameters themselves. Rather I think it far more natural to have opinions on the behavior of models. The prior predictive density is a good and sensible notion. Also if we has conditional densities for VARs then the prior conditional density. You have opinions about how variables interact and the forecast of some subset conditioning on the remainder. That this may or may not give enough info to ascribe a proper prior in parameter space all the better. To the extent it does not we must arbitrarily pick one (eg reference prior or maxent prior subject to the data/model prior constraints). Without reference to actual data I do not see much point in trying to have any opinion at all.
My reply: I do have some thoughts on the topic, especially after seeing Larry’s remark (which I agree with) that “noninformative priors are a lost cause.”
As I wrote in response to Larry, in some specific cases, noninformative priors can improve our estimates (see here, for example), but in general I’ve found that it’s a good idea to include prior information. Even weak prior information can make a big difference (see here, for example).
And, yes, we can formulate informative priors in high dimensions, for example by assigning priors to lower-dimensional projections that we understand. A reasonable goal, I think, is for us to set up a prior distribution that is informative without hoping that it will include all our prior information. Which is the way we typically think about statistical models in general. We still have a ways to go, though, in developing intuition and experience with high-dimensional models such as splines and Gaussian processes.
I will illustrate some of the simpler (but hardly trivial) issues with prior distributions with two small examples.
Example 1: Consider an experiment estimating comparing two medical treatments with an estimated effect of 1 (on some scale) with standard error 1. Such a result is, of course, completely consistent with a zero effect. The usual Bayes inference (with noninformative uniform prior) is N(1,1), thus implying an 84% probability that the effect is positive.
This seems wrong, the idea that something recognizable as pure noise can lead to 5:1 posterior odds.
The problem is coming from the prior distribution. We can see this in two ways. First, just directly, effects near zero are more common than large effects. In our 2008 paper, Aleks and I argued that logistic regression coefficients are usually less than 1. So let’s try combine N(1,1) data with a Cauchy(0,1) prior. It’s easy enough to do in Stan
First the model (which I’ll save in a file “normal.stan”):
theta ~ cauchy (0, 1);
y ~ normal (theta, 1);
Then the R script:
y <- 1
fit1 <- stan(file="normal.stan", data = "y", iter = 1000, chains = 4)
sim1 <- extract (fit1, permuted=TRUE)
print (mean (sim1$theta > 0))
The result is 0.77, that is, roughly a 3:1 posterior probability that the effect is positive.
Just to check that I’m not missing anything, let me re-run using the flat prior. New Stan model:
y ~ normal (theta, 1);
and then I rerun with the same R code. This time, indeed, 84% of my posterior simulations of theta are greater than 0.
So far so good. Although one might argue that the posterior probability of 0.77 (from the inference given the unit Cauchy prior) is still too high. Perhaps we want a stronger prior? This sort of discussion is just fine. If you look at your posterior inference and it doesn’t make sense to you, this “doesn’t make sense” corresponds to additional prior information you haven’t included in your analysis.
OK, so that’s one way to consider the unreasonableness of a noninformative prior in this setting. It’s not so reasonable to believe that effects are equally likely to be any size. They’re generally more likely to be near zero.
The other way to see what’s going on with this example is to take that flat prior seriously. Suppose theta really could be just about anything—or, to keep things finite, suppose you wanted to assign theta a uniform prior distribution on [-1000,1000], and then you gather enough data to estimate theta with a standard deviation of 1. Then, a priori, you’re nearly certain to gather very very strong information about the sign of theta. To start with, there’s a 0.998 chance that your estimate will be more than 2 standard errors away from zero so that your posterior certainty about the sign of theta will be at least 20:1. And there’s a 0.995 chance that your estimate will be more than 5 standard errors away from zero.
So, in your prior distribution, this particular event—that y is so close to zero that there is uncertainty about theta’s sign—is extremely unlikely. And it would be irrelevant that y is not statistically significantly different from 0.
Example 2: The basic mathematics above is, in fact, relevant in many many real-life situations. Consider one of my favorite examples, the study that found that more attractive parents were more likely to have girls. The result from the data, after running the most natural (to me) regression analysis, was an estimate of 4.7% (that is, in the data at hand, more beautiful parents in the dataset were 4.7 percentage points, on average, more likely to have girls, compared to less beautiful parents) with a standard error of 4.3%. The published analysis (which isolated the largest observed difference in a multiple comparisons setting) was a difference of 8% with a standard error of about 3.5%. In either case, the flat-prior analysis gives you a high posterior probability that the difference is positive in the general population, and a high posterior probability that this difference is large (more than 1 percentage point, say).
Why do I say that a difference of more than 1 percentage point would be large? Because, in the published literature on sex ratios, most differences (as estimated from large populations) are much less than 1%. For example, African-American babies are something like 0.5% more likely to be girls, compared to European-American babies. The only really large effects in the literature come from big things like famines.
Based on the literature and on the difficulty of measuring attractiveness, I’d say that a reasonable weak prior distribution for the difference in probability of girl birth, comparing beautiful and ugly parents in the general population, is N(0,0.003^2), that is, normal centered at 0 with standard deviation 0.3 percentage points. This is equivalent to data from approximately 166,000 people. (Consider a survey with n parents. Compare sex ratio of prettiest n/3 to ugliest n/3, s.e. is sqrt(0.5^2/(n/3) + 0.5^2/(n/3)) = 0.5 sqrt(6/n). Equivalent info: 0.003 = 0.5 sqrt(6/n). Solve for n, you get 166,000.
The data analysis that started all this was based on a survey of about 3000 people. So it’s hopeless. The prior is much much stronger than the data.
The traditional way of presenting such examples in a Bayesian statistics book would be to use a flat prior or weak prior, perhaps trying to demonstrate a lack of sensitivity to the prior. But in this case such a strategy would be a mistake.
And I think lots of studies have this pattern, we’re studying small effects with small samples and using inefficient between-subject designs (not that there are any alternatives in the sex-ratio example).
To get back to the general question about priors: yes, modeling can be difficult. In some settings the data are strong and prior information is weak, and it’s not really worth the effort to think seriously about what external knowledge we have about the system being studied. More often than not, though, I think we do know a lot, and we’re interested in various questions where data are sparse, and I think we should be putting more effort into quantifying our prior distribution.
Upsetting situations—for example, the data of 1 +/- 1 which lead to a seemingly too-strong claim of 5:1 odds in favor of a positive effect—are helpful in that they can reveal that we have prior information that we have not yet included in our models.