Yes, despite what you may have heard, you can easily fit hierarchical mixture models in Stan

There was some confusion on the Stan list that I wanted to clear up, having to do with fitting mixture models. Someone quoted this from John Kruschke’s book, Doing Bayesian Data Analysis:

The lack of discrete parameters in Stan means that we cannot do model comparison as a hierarchical model with an indexical parameter at the top level. There might be ways to work around this restriction by using clever programming contrivances, but presently there is nothing as straight forward as the model specification in JAGS.

Actually it’s easy to write hierarchical mixture models in Stan. Here’s the code (supplied by Bob Carpenter) for the specific model from Kruschke:

```data {
int N;
int y[N];
}
parameters {
real<lower=0, upper=1> theta;
}
model {
y ~ bernoulli(theta);
target += log_mix(0.5, beta_lpdf(theta | 2.75, 8.25),
beta_lpdf(theta | 8.25, 2.75));
}
```

Here’s the fit in R:

```N = 9
z = 6
y = c(rep(0, N - z), rep(1, z))
fit = stan("kruschke-db.stan")
print(fit)

Inference for Stan model: kruschke-db.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.

mean se_mean   sd  2.5%  25%   50%   75% 97.5% n_eff Rhat
theta  0.67    0.00 0.14  0.32  0.6  0.70  0.77  0.87  1409    1
lp__  -7.69    0.02 0.77 -9.69 -8.0 -7.38 -7.14 -7.09  2089    1
```

By comparison, here’s Krushke’s model in Jags:

```model {
for (i in 1:N) {
y[i] ~ dbern(theta)
}
theta ~ dbeta(omega[m]*(kappa-2)+1 , (1-omega[m])*(kappa-2)+1)
omega[1] <- .25
omega[2] <- .75
kappa <- 12
m ~ dcat(mPriorProb[])
mPriorProb[1] <- .5
mPriorProb[2] <- .5
}
```

The Stan discussion thread is here.

Speaking more generally on mixture models, Bob adds:

There are huge advantages in efficiency from marginalization, and in some cases it allows us to do inference in models where discrete sampling gets stuck. In other cases, discrete sampling is challenging if not impossible, as in Ising models or the combinatorially similar variable selection models.

It's come up with mixtures, the Cormack-Jolly-Seber model in ecology, and change-point models, all of which are explained in the manual. All of Kery and Straub's book has been translated and also Wagenmakers and Lee, both of which contain quite a few BUGS/JAGS models with discrete parameters.

Bob also points out that what's "very easy" for one person is a "contrivance" for another. And this goes in both directions. If you're already familiar with some software, whether it's Excel or Emacs or Mathematica or whatever, and it does the job for you, then you might as well continue using it. But if the software doesn't do everything you need, then it's time to learn something new.

The point of this post is that if Stan is working for you, or it could be working for you, but you heard that "we cannot do model comparison as a hierarchical model with an indexical parameter at the top level" then, no, this is not actually something to worry about. You actually can fit these models in Stan, it's completely straightforward. If you don't want to use Stan, that's fine too. We wrote Stan because existing programs couldn't fit the models we want to fit. Also we are planning to write a mixture model function in Stan that does exactly what's shown in the example above but doesn't have the "lpdf" things in case that upsets people.

(28 October: Stan model code fixed by Aki to show the constraint for theta)

1. David Rohde says:

Is there any plan to allow people to mix the Stan/Nuts sampler with their own MCMC kernels. For example combine Gibbs steps on discrete variables with Nuts on continuous?

• Andrew says:

David:

We’ve talked about such things, and with enough effort this sort of thing could be programmed and put into Stan. One reason we haven’t done it yet is that most of the applications for this idea can already be solved with mixture models. Indeed, the mixture model implementation of the model above should be much faster than any discrete latent-variable implementation. So just on computational grounds alone we prefer how Stan does this.

• David Rohde says:

Up front, I want to say how impressive the way this blog and Stan reach beyond narrow statistical/machine learning circles and is really changing the world. The quality of Stan really lifts the bar in terms of academic software, and I hope will inspire us all to lift our game there!

For these reasons though I would like to put my case that this type of capability would be really great to have. The model I was thinking of is this one: http://people.ee.duke.edu/~lcarin/PoissonProcess.pdf

which is trans-dimensional and certainly can’t be implemented by analytically marginalising the latent variables. (FWIW I expect the sampler they propose would be a little problematic in general due to the unbounded size of the matrix inversions that would occasionally come up..)

I imagine this as being an extension to (say) rstan where fit is replaced with a function that runs a single sample and returns the full state of the chain and additional kernels can be applied that maintain ergodicity.

I see your team has a NIPS paper this year on adding variational Bayes to Stan. The idea that this could be a template for future algorithm researchers is I think exciting. I hope this development continues both within and beyond your team. Giving algorithm developers access to both a modelling language and vector calculus routines has a lot of potential! Using the modelling language could also ease the transition from new algorithm development to actual use in the field by quite a lot. I appreciate these are tangential to your main goals though..

• Dan Simpson says:

Is that really trans-dimensional? My memory of this was it was a Cox process that replaces the more common (and I would argue interpretable) Log-Gaussian Cox process with a sigmoidal link function. An LGCP can be implemented as something like a Poisson-GP regression (modulo some small bias from approximating the likelihood).

A trans-dimensional sampler seems a large price to pay for replacing a pretty good method with something asymptotically exact…

• David Rohde says:

This model avoids the discretisation implicit in the method you suggest for LGCP, which is its important innovation. The sampler they propose includes an unbounded amount of latent variables – (because it is unbounded I see it as trans-dimensional – just to avoid semantic confusion). The benefit of having a sigmoidal link function is that it allows a rejection sampling method to sample from the model and this gives a latent variable representation of what would otherwise be a doubly intractable model.

Stan clearly can’t do this, maybe it can handle an approximation where a large fixed amount of latent variables is included.. (It can do a lot, but this is outside the scope).

• David Rohde says:

Maybe by asymptotically exact you mean arbitrarily fine spatial resolution?

My interest in this model is in part to compare it with discretised models including LGCP to see if there is any cost to be paid for the discretisation. I don’t believe the model/sampler is practical for analysis of non-toy data sets.

• Dan Simpsom says:

Asymptotically exact = computations will be dominated by MC error due to a slow mixing chain.

This error will usually be bigger than the discretisation bias (the MC errror for the discretised model negligible for the discretised model because you can get very low error approximate posteriors quickly using INLA or a good MCMC package, although not currently Stan as it doesn’t have a specialised version of a normal for a toeplitz covariance matrix, which is needed for efficiency).

It would probably be possible to improve the mixing by looking closely at what Gareth Roberts etc did with exact simulation from SDEs. In that case, breaking up the domain into blocks and building an importance sampler on each block and the chaining them together with SMC works better. This is because if your intensity function has a large range over your domain, you will need or reject a really large number of points)

There’s also the standard problem that Cox processes cluster, so the points at which the GP needs to be evaluated will often be close together, which will lead to bad conditioning of linear solves wither the covariance matrix and, in the case of the squared exponential xovariance function), exponential loss of precision when evaluation the log posterior in floating point arithmetic. This doesn’t happen the discretised systems (or it happens to a much smaller extent, because you don’t have explicit inverses and the GP is evaluated at well-separated points ).

So my feelings about this computational approach to this model is that it’s looking for love in all the wrong places. (And also that the paper should’ve compared the discrete and indiscreet versions of SGCPs rather than SGCP vs binned LGCP because the latter comparison doesn’t tell us anything about what is lost by binning)

As for binning vs not binning, for a smooth enough GP the error in the nxn grid approximation of the integral term in the log likelihood is O(n^{-2}) as this is just the midpoint rule. The error in evaluation the sum of the intensities by binning is O(n^{-1}), so you can make a better discretisation by paying better attention to that term. You can also do basis function-based approximations more generally than just pieceise constants on a latttice.

• David Rohde says:

Thanks Dan. It is interesting to talk to someone with such deep knowledge of spatial statistics. There are many things that you mention that I would like to know better (in particular INLA and the MRF approximation of Gaussian Processes)… I would imagine MCMC is usually not the tool of choice for spatial models.

The thing I find appealing about the model/sampler is its solution to the doubly intractable problem and the lack of the need for discretisation. It does provide in my view significant innovation in this space. Although, I can see that to someone who had built their career developing practical computational approaches to spatial statistics this may be of marginal interest. I would see it as an elegant model with a valid but possibly dubious MCMC sampler.

What I had in mind with using this model was an exploration of the effects of discretisation and similar problems such as combining point data sets with aggregate census data sets without having to pick a spatial resolution. These seem like valid questions, but are perhaps tangential to your main interests… although correct me if I am wrong!

I hadn’t thought of the clustering or bad conditioning issue. I don’t think it would apply here as the GP will be evaluated at the latent data points also.

Another possible approach to a non-discretised model would be to approximate the partition function of the point process. The difficulty seems to be that while you can integrate under some GPs (depending on the covariance function) you usually can’t once you transform it to be non-negative. If the link function was the exponential then the integral would become a sum of correlated log normal distributions. It seems there are some approximations such as the log skew normal approximation, which may be feasible… Its a bit speculative on my part though…

It may be useful to continue to discuss by email if you don’t mind.

Thanks!
David

• Daniel says:

Just FYI, you’ve always been able to write your own sampler around a Stan program. The Stan program (the text file that usually has a .stan extension) is a language that specifies the joint log probability distribution function of the data and parameters. There’s nothing in the Stan language that says how to do inference, e.g. number of leapfrog steps for NUTS is not in the language. The inference is done outside.

A few things to note:
– if you wanted to write your own sampler around a Stan program, you can. You would be working with the log joint probability distribution function with jacobians for transformations (optional) and gradients (optional).
– You’re actually not limited to one Stan program to code the final joint probability distribution function. If you were piecing together your own sampler, you’re free to piece together as many Stan programs as you see fit.
– What you currently don’t have is a graphical structure. So… something like determining conjugacy is out. It’s not out forever, it’s just not what can easily be backed out of the result of a Stan program.

Hopefully that clears things up. For many of the research projects kicking around, people are taking the log joint distribution function in C++ and writing their own sampler around it.

• For models coded in the Stan language, you can’t have discrete parameters. But we give you gradients and transforms to unconstrained space if you want to play around with continuous samplers or optimizers or variational approximations or anything else. And we’ll hopefully get autodiffed 2nd and 3rd order derivatives out (Hessians and gradients of Hessians).

• Yes, in general you can block samplers. In the limit, you get to (generalized) Gibbs.

PyMC3 is set up to let you mix NUTS for continuous parameters and Gibbs for discrete parameters.

Stan is not a directed graphical modeling language, so there’s no way to extract a Markov blanket for efficient conditional sampling.

Further, we have no idea how changing discrete parameters affects adaptation. I asked the PyMC3 developers at one point, and I don’t think they’ve done much evaluation of this, either.

But the real kicker for us is what Andrew said—it’s much more efficient computationally and statistically to marginalize out when possible.

2. Russell Almond says:

I actually worked on a problem using this a while ago. Problem is trickier than it looks in both JAGS and Stan because of the label switching problem. I fix this by swapping the labels after the MCMC run is finished.

I wrote a paper about it and it includes code in both JAGS and Stan (https://pluto.coe.fsu.edu/mcmc-hierMM/). The Stan code is a bit old and needs to be updated to run under Stan 2. It also includes the post-processing code for calculating WAIC in R.

Curiously, I remember JAGS and Stan taking about the same amount of time. Stan took longer per cycle, but the lower autocorrelation meant that fewer cycles were needed.

• Andrew says:

Russell:

I’ve dealt with the label-switching problem by using the “ordered” variable type in Stan, to restrict the mixture components’ location parameters (for example) to be ordered, then the aliasing goes away.

• ralmond says:

I’ve been curious about that. Fruthwirth-Schnatter recommends letting the chains run freely and then sorting them out afterwards. I’ve been worried about constraining the parameters to be ordered because that constrains the parameter space.

Looking over my paper, I didn’t test Stan with the constrained model. I’m not sure if that was a limitation of the earlier version of Stan I was using. At any rate, it should be fairly easy to do an A-B comparison of the two approaches using the code on my web site.

• They’re not always so easy to sort out afterwards unless the chains are very distinct. If the mixture is such that it mixes within a single chain, no amount of post-hoc sorting is going to disentangle it.

What you really need to do is set up inference in such a way that it doesn’t depend on the chain ID. That’s usually possible, but we don’t give you anyway to diagnose convergence with label switching in Stan.

Andrew and I and others have talked before about the affect of putting ordering on the parameters. It does change the model and will affect inference. For instance, if you have two parameters alpha and beta and you constrain alpha < beta, then we get alpha < beta in every posterior draw, even if they have marginal posterior intervals that overlap.

• Eric Loken says:

right, the attempt to order the parameters does ignore overlap in their posterior distributions. Another trick that works well in some cases is to fix the membership of one or two extreme cases. If only one case, then this is can be seen as nothing more than a definition (class A is the one that contains observation i). It has the effect of clobbering down the symmetric modes a good deal.

3. Ben Goodrich says:

Needs bounds in the declaration of theta

parameters {
real theta;
}

• Ben Goodrich says:

The website evidently eats the bounds. There should be a lower = 0, upper = 1 in angle-brackets between real and theta.

• David J. Harris says:

It’s possible that the website will allow the angle brackets if they are written with an ampersand followed by lt; (for less than) or gt; (for greater than).

&ltlower=0, upper=1>

4. Dear Andrew et al.:

Apologies if I inadvertently misled readers of DBDA2E. The intended goal of that example was to illustrate that (i) model comparison could be described as a posterior probability on an explicit model index and (ii) the model index can be explicitly coded in JAGS. From a pedagogical standpoint, the explicitness of the nominal model-index parameter is very convenient.

Apologies also if the word “contrivance” was offensive or misleading. I will turn the word back on myself: The entire example in question here was quite intentionally a contrivance to illustrate the idea of model comparison. As stated in the blog post, one person’s contrivance is another person’s convenience, and I should have been more careful with words.

I look forward to seeing all the JAGS examples in DBDA2E that use discrete parameters converted into Stan code. That could be quite useful and educational to a lot of people (including me!).

John

John

• And it’s usually easiest in terms of understanding the marginalized model to start from the model with the discrete parameter and then explicitly turn the algebra crank to marginalize it out.

I’m sure we can get your book examples translated to Stan sooner rather than later. I think we’ve answered questions about quite a few of them on our mailing lists already. If you’re interested, the Lee and Wagenmakers book examples have mostly been translated, and those are largely right up your alley in cognitive science. First, I’m going to take a pass over our existing translations to bring them up to date with our current thinking in terms of modeling and programming.

• Alex says:

> First, I’m going to take a pass over our existing translations to bring them up to date with our current thinking in terms of modeling and programming.

I was digging through normal mixture models recently, and found a couple examples, e.g. here’s one

https://github.com/stan-dev/example-models/blob/master/basic_estimators/normal_mixture.stan

They do need to be updated with things like normal_log —> normal_lpdf, and increment_log_prob —> target += and so on. I also generalized the example of known variance to have unknown variance on each of the mixtures. I can add this if you think it belongs in basic examples still.

5. Ian Fellows says:

Are spike and slab priors still not able to be implemented in Stan? My understanding of the current consensus is that they are not possible, with various unsatisfying arguments about how you shouldn’t actually want to use spike and slab. I understand that discrete parameters are difficult to integrate into the theory behind no u-turn sampling, but it seems a pretty big hole none the less.

• You can think of a cauchy distribution as a continuous approximation to a spike and slab. Stan works with differentiable probability densities, the transition from a spike to a slab needs to be smooth. You could if you like do a mixture model of the form suggested here, using a mixture of a gaussian with very small width around 0, and a gaussian with very large width around zero.

target += log_mix(0.5, normal_lpdf(theta | 0.001),
normal_lpdf(theta | 10000));

something like that, but I can’t guess how well it will sample, probably difficult to get working right.

• Dan Simpson says:

Spike-and-slab posteriors are really really weird (concentrating on a subspace of the 2^p possible vector spaces), which limits the applicability of most MCMC methods (including, but not limited to, those based on HMC). I’m not sure there are any off-the-shelf MCMC solutions for these models that actually explore the posterior well. Specific solutions exist and work reasonably well, see for instance, the collected works of Sylvia Frühwirth-Schnatter, but those are samplers specifically built for the problem, rather than general solutions.

There has been a lot of work done using Stan by Aki Vehtari and his co-conspiritors (especially Juho Piironen) on variable selection with the horseshoe and it’s variants, which can be seen as a continuous approximation to spike and slab.
(See: https://arxiv.org/abs/1610.05559 & https://arxiv.org/abs/1508.02502)

• Ian Fellows says:

So I guess my question is: Is it technically feasible to add discrete sampling, or is differentiability baked in so deeply that it would be impossible to add.

• Daniel says:

Yes, it is technically feasible.

To run HMC / NUTS, the log joint probability distribution function has to be differentiable.

There are heuristics that attempt to draw samples from discrete parameters using a different scheme and the continuous ones using HMC. Using the current language, someone could write a Stan program which essentially computes the joint log probability distribution function of data and parameters. Someone could then write an algorithm that uses the Stan program and does whatever inference that person wants. The optimization (BFGS, L-BFGS) implementation started that way. So did the ADVI implementation.

There are other algorithms we (the Stan development team and others that we’re in contact with) have implemented and haven’t gotten into Stan because empirically we can’t see any benefit. I’m sure a lot of these algorithms work for a subset of models or a particular data set. For Stan, we’re conservative and are only including things into Stan that work for a reasonable set of models. (ADVI is still experimental because it doesn’t work on everything.) But that’s what gets into the official Stan release.

I’ll always encourage more experimentation. If you think you have an algorithm that will work over discrete parameters, go ahead and code it up. It’s pretty easy to call the log joint probability distribution function from R using expose_stan_functions(). If you’ve got C++ chops, it’s really easy to call it directly in C++. I don’t believe that interleaving Gibbs over discrete parameters and then using NUTS for continuous parameters works in general. (I’m sure it works in special cases, but most of the models that Stan handles aren’t conjugate and nice.)

• Daniel Simpson says:

I don’t know what arguments have been put forward for not using spike-and-slab, but the obvious one is similar to the usual argument why you don’t do L0 optimisation for variable selection: it requires the exploration of an exponentially large discrete sample space, which is not possible. This inevitably leads to the situation where the posterior distribution from your model, and the thing outputted by your code can bear little or no resemblance to each other.

This doesn’t necessarily mean that the computer output isn’t useful (see, for instance, the recent work by Veronika Ročková and Ed George, who pivot off spike and slab onto something that’s better), but it means that it isn’t well covered by the usual structures of Bayesian analysis and needs to be studied and understood in it’s own way. It’s somethign that a lot of people are slowly working towards, but it’s nowhere near as well understood as more standard models. I don’t develop Stan, but if I did those things would stop me from trying to add specific support for spike-and-slab until a stable computation method was found.

• Ian Fellows says:

You are completely right of course. As you get into higher dimensions with regression the posteriors tend to put high probability on a small subset of parameter configurations that a gibbs sampler is unlikely to be able to jump between. But that doesn’t mean they are useless for lower dimensional problems (just like L0), or in other modeling cases where we can be reasonably assured that the posterior is unlikely to be highly correlated.

Stepping aside from the particulars of the spike and slab… One of the things I love about Stan is the ability to explore any model I can write down. Right now that exploration is bounded to continuous parameters. If I want to try something out with a discrete parameter I’m out of luck and have to leave the tool.

Not having an implementation of discrete parameters in Stan is taking a pretty hard line on the issue. Namely that discrete parameters are never appropriate or desirable for a Bayesian model. If that is the team’s position then great, but it just seems like a strong POV to impose.

• Dan Simpsom says:

Stan is open source. If you want to widen the POV, you know what to do.

• +1 to what Dan Simpson said! The combinatorics of the 2N dimensional selection problem is what makes it hard to fit. Gibbs will just get stuck. You have the same problem in otehr combinatorial models like latent Dirichlet allocation (LDA)—you just can’t do full Bayes over those because you can’t evaluate the integral over the posterior. People write Gibbs samplers for LDA nevertheless, but they immediately run into a local mode and sample around it rather than exploring the whole posterior. You have the same computational problem with discrete variable selection.

• David Rohde says:

Has anybody attempted to use annealing heuristics that connect isolated modes in an augmented space within Stan? I would assume this would be possible without any modification to Stan itself, by simply coding the hierarchy within the Stan language..

6. Jim Savage says:

I’ve got a couple of examples of fitting finite mixtures and time-varying mixtures, respectively:

https://modernstatisticalworkflow.blogspot.com/2016/10/finite-mixture-models-in-stan.html (note that the label switching fix obviously didn’t completely work when I compiled the post).

https://modernstatisticalworkflow.blogspot.com/2016/10/finite-mixture-model-with-time-varying.html

And have a draft of a post on Hamilton-style regime-switching models, which should be up in the next couple of days.

7. Phil says:

I’m kibbitzing on a project that I think is going to head towards Latent Class Analysis (LCA).

A bunch of different people were asked questions about some things for which there is an objective truth but we don’t know what it is. Some of the truths are continuous (“How long did the water system provide water?”) while others are discrete (“DId the system stop working because the (a) the well ran dry, (b) the pump broke, (c) the electrical system broke, or (d) something else”). Some of the people are more reliable than others, but we don’t know in advance who those will be. We’d like to know the correct answers, and the reliability of each person.

There are some obvious similarities with a common application of LCA, which is disease testing. You have a bunch of patients with unknown disease status. You have several different diagnostic tests or measures, none of them perfect. You want to know the disease status of each patient, and also the accuracy of each test. This is all very well explored in the literature.

For the continuous variables, it’s easy for me to formulate the model in Stan. For instance, as for how long the water system provided water, there’s some true value x. And maybe I can assume that the answer I get from person j is y(j) ~ normal( x, sigma(j)), and then I have a model that relates sigma(j) to how reliable that person is at answering true/false or multiple-choice questions.

I know how I would _like_ to do the discrete variables too. Suppose it’s a yes/no question. There’s some true answer T, and the answer I get from person j is a(j). Maybe I’m willing to assume that I get the right answer with probability r(j), where r(j) is the “reliability” for person j.

I assume that if I familiarize myself with the Stan approach I will be able to figure out how to do this. It may even be “straightforward” once I know how to do it. But it doesn’t seem straightforward to me now. Like other commenters, I look forward to having even more examples to learn from.

• Andrew says:

Phil:

I recommend that you post this question on the stan-users list.

• I coded the Dawid and Skene (1979) model for exactly the discrete diagnostistic conditions you bring up as an example in the latent discrete parameters chapter of the Stan manual. In the binary case, you get sensitivity and specificity parameters for the diagnostic tests, and a latent disease state for each patient, which is unknown (or maybe only known in some cases), and you put it all together in the usual Bayesian way to do inference on the tests’ sensitivity and specificity (alternatively, their accuracy and bias) and also on the prevalence of the disease in the population and on the disease status of each patient. I was mainly applying it to getting labeled data for training natural language processing systems from noisy sources like Amazon’s Mechanical Turk.

As a historical note, Jennifer and Andrew helping me code this model up from scratch (didn’t know I was reproducing Dawid and Skene’s model at the time) was my gateway model for Bayesian inference and I’ve done lots of work on alternatives that aren’t in the manual (like jointly training a logistic regression model with noisy annotation data, as suggested in the JMLR paper of Raykar et al.).

• I should’ve mentioned that the epidemiology literature is full of these models and the marginalizations are usually included because they were fitting maximum likelihood after marginalizing out the discrete parameters (Dawid and Skene used EM, for example, and the marginalizations are in the E step). Same thing with the CJS models and their like in the ecology literature.

Nowadays, you’re likely to find lots of people reinventing the approaches of the epidemiologists over in the machine learning literature under the heading “crowdsourcing”. I have a whole series on my old blog discussing reinventions of the Dawid and Skene model (I was sympathetic having done it myself).

8. Stephen Martin says:

I fit mixtures in Stan frequently. The only issue is within chain label switching, which one can usually avoid by adding identifiability constraints which helps push label switched modes further away from each other with a trough between them. Usually each chain will settle on one of the modes and not leave. Between chain switching is fine, just examine whether they converged on equal, but label switched solutions. Pick out converged chains for the summary.

Gibbs doesn’t often have within chain switching, but that’s because it is generally poor at exploring the full posterior.

9. Anonymous says:

I fit mixture models in stan frequently.
In fact, I’m hoping to soon publish a paper that makes HEAVY use of mixture modeling — It’s a measurement model (2PL IRT, for now) that allows for multiple item models, and within a certain item model is a probabilistic clustering model to assign individuals to latent groups.
The only issue with mixtures in MCMC is the label switching problem, and of those, the only real issue is within-chain switching.
This is, of course, because there are multiple posterior modes that are equal except the labels are swapped, so each individual chain may find and explore each mode, yielding posterior estimates that are pointless.
This can be fixed, to some degree, by imposing some identifiability constraints to create unique posterior modes. Generally, this works, because it either kills off the other modes, or it pushes them further away in space. So far, this has worked fine for me most of the time.
Between chain switching is irrelevant; just examine the posteriors to see which chains explored which modes, extract those that explored the same modes and examine those chains.

When I first started using Stan, I hated this. JAGS never had the within-chain switching problem, so I never had to impose these constraints. Well, as I soon found out, this is because Gibbs sampling sucks at exploring the full posterior space, unlike Stan.
Nevertheless, the switch was worth it; I could run 2500 post-warmup gibbs iterations on my model in an hour and get about 40 effective samples; I do the same in HMC for 2.5 hours, and I get 1800 effective samples.

The choice is obvious.