## Multimodality in hierarchical models

Jim Hodges posted a note to the Bugs mailing list that I thought could be of more general interest:

Is multi-modality a common experience? I [Hodges] think the answer is “nobody knows in any generality”. Here are some examples of bimodality that certainly do *not* involve the kind of labeling problems that arise in mixture models.

The only systematic study of multimodality I know of is

Liu J, Hodges JS (2003). Posterior bimodality in the balanced one-way random effects model. J.~Royal Stat.~Soc., Ser.~B, 65:247-255.

The surprise of this paper is that in the simplest possible hierarchical model (analyzed using the standard inverse-gamma priors for the two variances), bimodality occurs quite readily, although it is much less common to have two modes that are big enough so that you’d actually get a noticeable fraction of MCMC draws from both of them. Because the restricted likelihood (= the marginal posterior for the two variances, if you’ve put flat priors on them) is necessarily unimodal in this model, the bimodality must arise from conflict between the prior and likelihood, but as this paper shows, the conflict that produces bimodality is extremely complex.

See also Jon Wakefield’s discussion of this paper:

Hodges JS (1998). Some algebra and geometry for hierarchical models, applied to diagnostics (with discussion). {\it Journal of the Royal Statistical Society, Series B}, {\bf 60}:497–536.

Here a simple, harmless-looking two-level model with normal errors and random effect had a bimodal posterior. I don’t know what features of the data, model, and priors produced this.

My former student Brian Reich also got bimodal posteriors fitting the models and data described in this paper:

Reich BJ, Hodges JS, Carlin BP (2007). Spatial analysis of periodontal data using conditionally autoregressive priors having two types of neighbor relations. {\it Journal of the American Statistical Association},{\bf 102}:44–55.

However, those fits don’t appear in this paper (long story).

### 4 Comments

1. K? O'Rourke says:

Ran into multimodal likelihoods in my thesis
http://andrewgelman.com/movabletype/mlm/ThesisReprint.pdf

And this was one of the few papers I found that dealt with it -

Vangel, M. G., and Rukhin, A. L. Maximum likelihood analysis for heteroscedastic one-
way random e¤ects ANOVA in interlaboratory studies. Biometrics 55, 1 (1999), 129-136.

It will be missed if within lab/study variances as taken as estimated and assumed known.

> I don’t know what features of the data, model, and priors produced this.
Not sure what Jim means by this – surely plotting the right marginal (maybe dim > 1) would display what is leading to the multimodality. (By features he probably means condtions.)

Thanks for posting.

2. Iain says:

What’s the force law of the sun (mass and gravitational exponent) given a snapshot of the positions and velocities of the major planets around it?
The answer to that toy problem is beautifully bimodal: http://iopscience.iop.org/0004-637X/711/2/1157/ or http://arxiv.org/abs/0903.5308 (Also, no sensible inferences are made without a hierarchical model.)

3. TGGP says:

I don’t know if anyone still cares about Satoshi Kanazawa here, but his university has banned him from publishing in any non-peer-reviewed venue for a year.

4. Jason Eisner says:

Multimodality is certainly a “common experience” — the usual experience — in models with latent variables. For example, if you try using EM to find the maximum-likelihood context-free grammar for natural language text (marginalizing over the unknown derivation tree that produced the text), trying 100 different random starting points will unfortunately get you 100 different local maxima.

Noah Smith’s dissertation (2006) has plenty of pictures illustrating the problem. These aren’t mere symmetries (what Hodges calls “the kind of labeling problems that arise in mixture models”), since they all have different values of the maximization objective.

The same issues arise when EM or MAP-EM is applied in simpler settings such as mixture models, HMMs, and MRFs.

Maybe I misunderstood? (The question seemed odd, since non-convex minimization must be the most pervasive computational obstacle in machine learning, rivaled only by intractable summation.)