## How I think about mixture models

Larry Wasserman refers to finite mixture models as “beasts” and writes jokes that they “should be avoided at all costs.”

I’ve thought a lot about mixture models, ever since using them in an analysis of voting patterns that was published in 1990. First off, I’d like to say that our model was useful so I’d prefer not to pay the cost of avoiding it. For a quick description of our mixture model and its context, see pp. 379-380 of my article in the Jim Berger volume). Actually, our case was particularly difficult because we were not even fitting a mixture model to data, we were fitting it to latent data and using the model to perform partial pooling. My difficulties in trying to fit this model inspired our discussion of mixture models in Bayesian Data Analysis (page 109 in the second edition, in the section on “Counterexamples to the theorems”).

I agree with Larry that if you’re fitting a mixture model, it’s good to be aware of the problems that arise if you try to estimate its parameters using maximum likelihood or Bayes with flat priors.

So what did we do? We used a weakly informative prior distribution. I think this is the right thing to do. The trouble with mixture models, in some sense, is that that the natural mathematical formulation is broader than what we typically want to fit. Some prior constraints, particularly on the ratio of the mixture variances, controls the estimates and also makes sense in that in any application I’ve seen, I have some idea about a reasonable range of these variances.

What’s confusing, I think, is that we have developed some complacent intuitions based on various simple models with which we are familiar. If you fit a normal or binomial or Poisson model with direct data, you’ll usually get a simple reasonable answer (except for some known tough cases such as estimating a rate when the number of events in the data is zero). So we just start to assume that this is the way it should always be, that we can write down a mathematically convenient class of models and go fit to data. In general, though, this won’t work. We’ve seen this for logistic regression with complete separation, and it happens for mixture models too. The class of mixture models is general enough that we always have the equivalent of complete separation, and we need to constrain the set of parameter models to ensure reasonable estimates.

In summary, yes, a mixture model can be a “beast” (as Larry puts it), but this beast can be tamed with a good prior distribution. More generally, I think prior distributions for mixture models can be expressed hierarchically, which connects my sort of old-fashioned models to more advanced mixture models that have potentially infinite dimension.

1. I agree with this. Certainly, if you try to do a mixture model with naive priors like that all mixture components have a broad prior for their central position or something, then you can run into trouble (computationally as well as conceptually). But the key is to go heirarchical like you suggest – tie the mixture components together so they are “close” in some way.

I certainly disagree with the idea of not using mixture models. IMO they’re probably the closest to representing our actual prior beliefs in many many problems. Single component parametric models can be too simple, but the free-form models are overkill a lot of the time. Mixture models are in the happy zone in between.

2. Larry Wasserman says:

I hope it was clear that my comment that
“they should be avoided at all costs” was a joke.

–Larry

• K? O'Rourke says:

Always a risk that jokes can be taken as serious – but your post very clear and concise.

This switch from singleton points in a parameter space generating random observations to random draws of unobserved points from a (hyper) parameter space generating random observations always seemed conceptually to me to be a 1000 fold increase in complexity (in addition to the technical challenges you point out).

(Maybe why there is almost nothing about mixture models/random effects/multi-level models/hierarchical models/latent data models/etc. in Fisher’s writings. But I believe he somehow knew to avoid Tequila.)

• jimmy says:

tone can be hard to convey. prefacing that my knowledge of stats is limited, i could not tell that it was a joke.

3. Corey says:

Warning: the following comment is mostly pointless — it just describes some thoughts that have been rattling around my noggin recently and seem apposite to the OP.

There’s some tension in AG’s position on probabilities for elements of discrete sets of models (as in Bayesian model selection) and his position on the usefulness of (finite) mixture models. To fit a finite mixture model, at some stage one generally has to compute the posterior probability that a given datum belongs to a particular mixture component. This computation is a special case of the one that gives probabilities for Bayesian model selection.

The tension isn’t an outright contradiction — it’s not even close. AG’s position on discrete model selection is about inference, not about computation. And AG has already noted that there is an inconsistency in his position because there isn’t a bright line between probabilities for discrete model and discrete approximations of probability densities for continuous parameters. (My Google-fu is failing me — I can’t find the post right now.)

• Andrew says:

Corey:

Yup. See, for example, the discussion on p.76 of this article:

I admit, however, that there is a philosophical incoherence in my approach! Consider a simple model with independent data y1, y2, .., y5 ∼ N(θ,σ^2), with a prior distribution θ ∼ N(0,10^2) and σ known and taking on some value of approximately 10. Inference about θ is straightforward, as is model checking, whether based on graphs or numerical summaries such as the sample variance and skewness.

But now suppose we consider θ as a random variable defined on the integers. Thus θ=0 or 1 or 2 or 3 or … or -1 or -2 or -3 or…,and with a discrete prior distribution formed by the discrete approximation to the N(0,10^2) distribution. In practice, with the sample size and parameters as defined above, the inferences are essentially unchanged from the continuous case, as we have defined θ on a suitably tight grid.

But from the philosophical position argued in the present article, the discrete model is completely different: I have already written that I do not like to choose or average over a discrete set of models. This is a silly example but it illustrates a hole in my philosophical foundations: when am I allowed to do normal Bayesian inference about a parameter θ in a model, and when do I consider θ to be indexing a class of models, in which case I consider posterior inference about θ to be an illegitimate bit of induction? I understand the distinction in extreme cases—they correspond to the difference between normal science and potential scientific revolutions—but the demarcation does not cleanly align with whether a model is discrete or continuous.

4. Enes says:

An alternative way of fitting mixture models in the Bayesian framework is with the Minimum Message Length (MML) principle. The basic idea is that the ‘best’ model for the data is the one that results in the best compression of a two-part message comprising a model and the data. A short summary of the MML principle and the application of MML to mixture modelling can be found in [1]. For those wanting to learn more about MML, I would highly recommend the book [2].

Currently, there exists an MML software package (called SNOB) for fitting mixture models but it is not very user friendly. We are working on a new software package for MML clustering that will run in MATLAB and this new package will be made freely available for download in the next few weeks.

Cheers,
Enes

Refrences:
[1] Wallace, C. S. & Dowe, D. L. MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions Statistics and Computing, Vol. 10, pp. 73-83, 2000.

[2] Wallace, C. S. Statistical and Inductive Inference by Minimum Message Length, Springer, 2005.