Ben Bolker wrote:

It would be fantastic if you could suggest one or two starting points for the idea that/explanation why BIC should naturally fail to identify the number of clusters correctly in the cluster-analysis context.

Bob Carpenter elaborated:

Ben is finding that using BIC to select number of mixture components is selecting too many components given the biological knowledge of what’s going on. These seem to be reasonable mixture models like HMMs for bison movement with states corresponding to transiting and foraging and resting, with the data (distance moved and turning angle) being clearly multimodal.

First (this is more to Christian): Is this to be expected if the models are misspecified and the data’s relatively small?

Second (this one more to Andrew): What do you recommend doing in terms of modeling? The ecologists are already on your page w.r.t. adding predictors (climate, time of day or year) and general hierarchical models over individuals in a population.

Number of components isn’t something we can put a prior on in Stan other than by having something like many mixture components with asymmetric priors or by faking up a Dirichlet process a la some of the BUGS examples. I’ve seen some work on mixtures of mixtures which looks cool, and gets to Andrew’s model expansion inclinations, but it’d be highly compute intensive.

X replied:

Gilles Celeux has been working for many years on the comparison between AIC, BIC and other-ICs for mixtures and other latent class models. Here is one talk he gave on the topic. With the message that BIC works reasonably well for density estimation but not for estimating the number of clusters. Here is also his most popular paper on such information criteria, including ICL.

I am rather agnostic on the use of such information criteria as they faiL to account for prior information or prior opinion on what’s making two components distinct rather than identical. In that sense I feel like the problem is non-identifiable. If components are not distinguishable in some respect. And as a density estimation problem, the main drawback in having many components is an increased variability. This is not a Bayesian/frequentist debate, unless prior inputs can make components make sense. And prior modelling fights against over-fitting by picking priors on the weight near zero (in the Rousseau-Mengersen 2012 sense).

And then I wrote:

I think BIC is fundamentally different from AIC, WAIC, LOO, etc, in that all those other things are estimates of out-of-sample prediction error, while BIC is some weird thing that under certain ridiculous special cases corresponds to an approximation to the log marginal probability.

Just to continue along these lines: I think it makes more sense to speak of “choosing” the number of clusters or “setting” the number of clusters, not “estimating” the number of clusters, because the number of clusters is not in general a Platonic parameter that it would make sense to speak of estimating. I think this comment is similar to what X is saying, just in slightly different language (although both in English, pas en français).

To put it another way, what does it mean to say “too many components given the biological knowledge of what’s going on”? This depends on how “component” is defined. I don’t mean this as a picky comment: I think this is fundamental to the question. To move the discussion to an area I know more about: suppose we want to characterize voters. We could have 4 categories: Dem, Rep, Ind, Other. We could break this down more, place voters on multiple dimensions, maybe identify 12 or 15 different sorts of voters. Ultimately, though, we’re each individuals, so we could define 300 million clusters, one for each American. It seems to me that the statement “too many components” has to be defined with respect to what you will be doing with the components. To put it another way: what’s the cost to including “too many” components? Is the cost that estimates will be too noisy? If so, there is some interaction between #components and the prior being used on the parameters: one might have a prior that works well for 4 or 5 components but not so well when there are 20 or 25 components.

Actually, I can see some merit to the argument that there can just about never be more than 4 or 5 clusters, ever. My argument goes like this: if you’re talking “clusters” you’re talking about a fundamentally discrete process. But once you have more than 4 or 5, you can’t really have real discreteness; instead things slide into a continuous model.

OK, back to the practical question. Here I like the idea of using LOO (or WAIC) in that I understand what it’s doing: it’s an estimate of out-of-sample prediction error, and I can take that for what it is.

To get to the modeling question: if Ben is comfortable with a model with, say, between 3 and 6 clusters, then I think he should just fit a model with 6 clusters. Just include all 6 and let some of them be superfluous if that’s what the model and data want. One way to keep the fitting under control is to regularize a bit by putting strong priors on the weights on the mixtures, so that mixture components 1, 2, etc, are large in expectation large, and later components are smaller. You can do this with an informative Dirichlet prior on the vector of lambda parameters. I’ve never tried this but it seems to me like it could work.

Also–and I assume this is being already but I’ll mention just in case–don’t forget to put informative priors for the parameters in each mixture component. I don’t know the details of this particular model, but, just for example, if we are fitting a mixture of normals, it’s important to constrain the variances of the normals because the model will blow up with infinite likleihood at points where any variance equals zero. The constraint can be “soft,” for example lognormal priors on scale parameters, or a hierarchical prior on the sale parameters with a proper prior on how much they vary. The same principle applies to other sorts of mixture models.

And Aki added:

If the whole distribution is multimodal it is easier to identify the number of modes and say that these correspond to clusters. Even if we have “true” clusters, but they are overlapping so that there are no separate modes, the number of clusters is not well identified *unless* we have lot of information about the shape of each cluster. Example: using mixture of Gaussians to fit Student t data -> when n->infty, the number of components (clusters) goes to infty. Depennding on the amount of model misspecification and separability of clusters we may not be able to identify the number of clusters no matter which criteria we use. In simulated examples with true small number of clusters, use of criteria which favors small number of clusters is likely to perform well (LOO (or WAIC) is likely to favor more clusters than marginal likelihood, BIC or WBIC). In Andrew’s voters example, and in many medical examples I’ve seen, there are no clear clusters as the variation between individuals is mostly continuous or discrete in high dimensions.