Beautiful paper on HMMs and derivatives

I’ve been talking to Michael Betancourt and Charles Margossian about implementing analytic derivatives for HMMs in Stan to reduce memory overhead and increase speed. For now, one has to implement the forward algorithm in the Stan program and let Stan autodiff through it. I worked out the adjoint method (aka reverse-mode autodiff) derivatives of the HMM likelihood (basically, reverse-mode autodiffing the forward algorithm), but it was stepwise and the connection to forward-backward wasn’t immediately obvious. So I thought maybe someone had already put a bow on this in the literature.

It was a challenging Google search, but I was rewarded with one of the best papers I’ve read in ages and by far the best thing I’ve ever read on hidden Markov models (HMM) and their application:

The paper provides elegant one-liners for the forward algorithm, the backward algorithm, the likelihood, and the derivative of the likelihood with respect to model parameters. For example, here’s the formula for the likelihood:

$latex L = \pi^{\top} \cdot \textrm{diag}(B_1) \cdot A \cdot \textrm{diag}(B_2) \cdot \cdots A \cdot \textrm{diag}(B_T) \cdot 1.$

where $latex \pi$ is the initial state distributions, $latex B_t$ is the vector of emission densities for the states, $latex A$ is the stochastic transition matrix, and $latex 1$ is a vector of 1s. Qin et al.’s software uses an external package to differentiate the solution for $latex \pi$ as the stationary distribution for the transition matrix $latex A,$, i.e., $latex \pi^{\top} \cdot A = \pi^{\top}.$

The forward and backward algoritms are stated just as neatly, as are the derivatives of the likelihood w.r.t parameters. The authors put the likelihood and derivatives together to construct a quasi-Newton optimizer to fit max likelihood estimates of HMM. They even use second derivatives for estimating standard errors. For Stan, we just need the derivatives to plug into our existing quasi-Newton solvers and Hamiltonian Monte Carlo.

But that’s not all. The paper’s about an application of HMMs to single-channel kinetics in chemistry, a topic about which I know nothing. The paper starts with a very nice overview of HMMs and why they’re being chosen for this chemistry problem. The paper ends with a wonderfully in-depth discussion of the statistical and computational properties of the model. Among the highlights is the joint modeling of multiple experimental data sets with varying measurement error.

In conclusion, if you want to understand HMMs and are comfortable with matrix derivatives, read this paper. Somehow the applied math crowd gets these algorithms down correctly and cleanly where the stats and computer science literatures flail in comparison.

Of course, for stability in avoiding underflow of the densities, we’ll need to work on the log scale. Or if we want to keep the matrix formulation, we can use the incremental rescaling trick to rescale the columns of the forward algorithm and accmulate our own exponent to avoid underflow. We’ll also have to autodiff through the solution to the stationary distirbution algorithm, but Stan’s internals make that particular piece of plumbing easy to fit and also a potential point of dervative optimization. We also want to generalize to the case where the transition matrix $latex A$ depends on predictors at each time step through a multi-logit regression. With that, we’d be able to fit anything that can be fit with the nicely designed and documented R package moveHmm, which can already be used in R to fit a range of maximum likelihood estimates for HMMs.

HMMs in Stan? Absolutely!

I was having a conversation with Andrew that went like this yesterday:

Andrew:
Hey, someone’s giving a talk today on HMMs (that someone was Yang Chen, who was giving a talk based on her JASA paper Analyzing single-molecule protein transportation experiments via hierarchical hidden Markov models).

Maybe we should add some specialized discrete modules to Stan so we can fit HMMs. Word on the street is that Stan can’t fit models with discrete parameters.

Me:
Uh, we can already fit HMMs in Stan. There’s a section in the manual that explains how (section 9.6, Hidden Markov Models).

We’ve had some great traffic on the mailing list lately related to movement HMMs, which were all the rage at ISEC 2016 for modeling animal movement (e.g., whale diving behavior in the presence or absence of sonar).

Andrew:
But how do you marginalize out the discrete states?

Me:
The forward algorithm (don’t even need the backward part for calculating HMM likelihoods). It’s a dynamic programming algorithm that calculates HMM likelihoods in $latex \mathcal{O}(N \, K^2)$, where $latex N$ is the number of time points and $latex K$ the number of HMM states.

Andrew:
So you built that in C++ somehow?

Me:
Nope, just straight Stan code. It’s hard to generalize to C++ given that the state emissions models vary by application and the transitions can involve predictors. Like many of these classes of models, Stan’s flexible enough to, say, implement all of the models described in Roland Langrock et al.’s moveHMM package for R. The doc for that package is awesome—best intro the models I’ve seen. I like the concreteness of resolving details in code. Langrock’s papers are nice, too, so I suspect his co-authored book on HMMs for time series is probably worth reading, too. That’s another one that’d be nice to translate to Stan.

More HMMs in Stan coming soon (I hope)

Case study on movement HMMs coming soon, I hope. And I’ll pull out the HMMs tuff into a state-space chapter. Right now, it’s spread between the CJS ecology models (states are which animals are alive) in the latent discrete parameters chapters and HMMs in the time-series chapter. I’m also going to try to get some of the more advanced sufficient statistics, N-best, and chunking algorithms I wrote for LingPipe going in Stan along with some semi-supervised learning methods I used to work on in NLP. I’m very curious to compare the results in Stan to those derived by plugging in (penalized) maximum likelihood parameter estimates

Getting working movement HMM code

Ben Lambert and I managed to debug the model through a mailing list back and forth, which also contains code for the movement HMMs (which also use a cool wrapped Cauchy distribution that Ben implemented as a function). The last word on that thread has the final working models with covariates.

P.S.

We should be able to fit Yang Chen’s hierarchical HMMs in Stan. She assumed a univariate normal emission density that was modeled hierarchically across experiments, with a shared transition matrix involving only a few states. We wouldn’t normally perfrom her model selection step; see Andrew’s previous post on overfitting cluster analysis which arose out of a discussion I had with Ben Bolker at ISEC.