Larry Wasserman refers to finite mixture models as “beasts” and
writes jokes that they “should be avoided at all costs.”
I’ve thought a lot about mixture models, ever since using them in an analysis of voting patterns that was published in 1990. First off, I’d like to say that our model was useful so I’d prefer not to pay the cost of avoiding it. For a quick description of our mixture model and its context, see pp. 379-380 of my article in the Jim Berger volume). Actually, our case was particularly difficult because we were not even fitting a mixture model to data, we were fitting it to latent data and using the model to perform partial pooling. My difficulties in trying to fit this model inspired our discussion of mixture models in Bayesian Data Analysis (page 109 in the second edition, in the section on “Counterexamples to the theorems”).
I agree with Larry that if you’re fitting a mixture model, it’s good to be aware of the problems that arise if you try to estimate its parameters using maximum likelihood or Bayes with flat priors.
So what did we do? We used a weakly informative prior distribution. I think this is the right thing to do. The trouble with mixture models, in some sense, is that that the natural mathematical formulation is broader than what we typically want to fit. Some prior constraints, particularly on the ratio of the mixture variances, controls the estimates and also makes sense in that in any application I’ve seen, I have some idea about a reasonable range of these variances.
What’s confusing, I think, is that we have developed some complacent intuitions based on various simple models with which we are familiar. If you fit a normal or binomial or Poisson model with direct data, you’ll usually get a simple reasonable answer (except for some known tough cases such as estimating a rate when the number of events in the data is zero). So we just start to assume that this is the way it should always be, that we can write down a mathematically convenient class of models and go fit to data. In general, though, this won’t work. We’ve seen this for logistic regression with complete separation, and it happens for mixture models too. The class of mixture models is general enough that we always have the equivalent of complete separation, and we need to constrain the set of parameter models to ensure reasonable estimates.
In summary, yes, a mixture model can be a “beast” (as Larry puts it), but this beast can be tamed with a good prior distribution. More generally, I think prior distributions for mixture models can be expressed hierarchically, which connects my sort of old-fashioned models to more advanced mixture models that have potentially infinite dimension.