I’ve had a couple of email conversations in the past couple days on dependence in multivariate prior distributions.
Modeling the degrees of freedom and scale parameters in the t distribution
First, in our Stan group we’ve been discussing the choice of priors for the degrees-of-freedom parameter in the t distribution. I wrote that also there’s the question of parameterization. It does not necessarily make sense to have independent priors on the df and scale parameters. In some sense, the meaning of the scale parameter changes with the df.
Prior dependence between correlation and scale parameters in the scaled inverse-Wishart model
The second case of parameterization in prior distribution arose from an email I received from Chris Chatham pointing me to this exploration by Matt Simpson of the scaled inverse-Wishart prior distribution for hierarchical covariance matrices. Simpson writes:
A popular prior for Σ is the inverse-Wishart distribution [not the same as the scaled-inverse Wishart model; see discussion below], but there are some problems . . . using the standard “noninformative” version of the inverse-Wishart prior, which makes the marginal distribution of the correlations uniform, large standard deviations are associated with large absolute correlations. This isn’t exactly noninformative . . .
I agree with half of the above statement. As I wrote in the book with Jennifer, the inverse-Wishart does not seem flexible enough as a prior, but I think the key problem is not the prior correlations between the correlation and scale parameters but rather the restricted prior range of the scales. If you are roughly noninformative on the correlations, the scales get constrained quite a bit. We wanted a prior that allowed us to be less informative on the scale parameters while still expressing ignorance about the correlations. (Later on, Simpson gets to this point; I just think it’s the most importance concern with the inverse-Wishart prior and would mention the problem right away.)
An alternative strategy based on the separation strategy comes from a paper (pdf) by O’Malley and Zaslavsky . . . instead of modeling Ω as a correlation matrix, only constrain it to be positive semi-definite so that Δ and Ω jointly determine the standard deviations, but Ω still determines the correlations alone. . . . the scaled inverse-Wishart is a much easier to work with, but theoretically it still allows for some dependence between the correlations and the variances in Σ.
Simpson then performs some simulations from various prior distributions and makes some graphs, after which he concludes:
The [scaled inverse-Wishart] prior . . . exhibits . . . disturbing dependence . . . High variances are associated with more extreme correlations . . . It doesn’t look great for the scaled inverse-Wishart.
Simon Barthelmé elaborates:
End users of a statistical method don’t expect the method to mix up two things they think of as independent. We think of correlation and scale as being two different things, and we can come up with reasonable constraints on what they should be. Losing this independence is a fairly significant sacrifice to computational convenience. . . . Frequentists have a point when they criticise Bayesians who argue that priors are fantastic because they allow you to express useful prior knowledge, and then turn around and use a conjugate prior because that’s what’s convenient. . . .
First, let me respond to the point immediately above. I agree that there is typically a tradeoff between accuracy and convenience. I don’t think “frequentists” are so uniformly great about this. Lots of classical statistical analysis is based on assuming the binomial distribution, the Poisson distribution, the normal distribution, the logistic transformation, etc. Assumptions in the likelihood part of the model can have much more effect than assumptions in the prior. Many researchers are aware of this—consider the vast literature on robust statistics—but all the awareness in the world won’t get you around the fact that (a) choices in the data model can make a difference in your inferences, and (b) classical statistics textbooks typically recommend default model choices. So I wouldn’t at all single out Bayesians here. The prior is one more part of the model, yes, but using maximum likelihood (say) is itself a choice (or, one might say, an assumption).
Now, back to the main point. Simpson and Barthelmé are unhappy with the scaled-inverse Wishart prior because they feel that the correlation and standard deviations should be a priori independent. But why? I don’t see that it’s so important to have prior independence of these parameters when ρ is close to ±1. What does it really mean when ρ is close to ±1? In that case, the model isn’t really doing what’s expected anyhow. Prior independence can be convenient but it’s all dependent on parameterization anyway.
To put this more technically, consider the following three parameterizations in the two-dimensional case. For simplicity I’ll set the means to 0 and the variances to be equal:
(1) (x,y) ~ N ((0,0), ((σ^2, ρσ^2), (ρσ^2, σ^2)))
(2) x|y ~ N (ρy, (1-ρ^2)σ^2); y|x ~ N (ρy, (1-ρ^2)σ^2)
(3) x|y ~ N (ρy, σ^2); y|x ~ N (ρy, σ^2).
Models (1) and (2) are identical, and in both cases ρ is the correlation and σ is the marginal standard deviation (just as in Simpson and Barthelmé’s parameterizations).
Model (3) is the same as (1) and (2) but with a transformation. Now σ is the conditional standard deviation; the marginal standard deviation is σ/√(1-ρ^2). If you set independent priors on ρ and σ in model (3), this will induce dependence between ρ and the marginal standard deviation. When ρ is close to 1 in absolute value, the marginal standard deviation will be higher on average. Sounds familiar? Indeed, that’s the behavior that Simpson found with his simulations.
In practice . . .
The above is not to say that the scaled-inverse Wishart model is best, or that prior independence of correlations and conditional variances is better (or worse) in general than prior independence of correlations and marginal variances. I’m just pointing out that the concept of independence in a probability distribution is not as simple as it might seem. A statement such as, “We think of correlation and scale as being two different things” does not imply that we should have prior independence in some particular parameterization.
That said, in Stan we’ve actually been working with a prior distribution in which the correlation and marginal variance parameters are independent. For HMC, the practical constraints on computation with covariance matrices do not involve conjugacy but rather have to do with transforming the correlation matrix into a set of unconstrained variables.
Models vs. priors
The other thing to remember is that a prior distribution exists in relation to the likelihood. Consider the simple case of a two-dimensional covariance matrix with a single correlation parameter ρ. The uniform prior distribution on [-1,1] seems reasonable. But is this what we really expect to see? Probably not. We try to parameterize our problems so that parameters are close to independent. I’m guessing that values of ρ near 0 are much more probable, a priori, than values near ±1. But maybe a uniform prior will lead to reasonable default inferences.
P.S. When I write these long discussions, I always wonder whether to put them into my books. On one hand, a clear understanding of transformations can be important when working with prior distributions. On the other, if I were to include all the conversations of this sort, it would double the length of our books—and one thing non-Burdzy readers like about these books is their practicality, that we get right to the point without getting lost in argument.