Derivative-based MCMC as a breakthrough technique for implementing Bayesian statistics

John Salvatier pointed me to this blog on derivative based MCMC algorithms (also sometimes called “hybrid” or “Hamiltonian” Monte Carlo) and automatic differentiation as the future of MCMC.

This all makes sense to me and is consistent both with my mathematical intuition from studying Metropolis algorithms and my experience with Matt using hybrid MCMC when fitting hierarchical spline models. In particular, I agree with Salvatier’s point about the potential for computation of analytic derivatives of the log-density function. As long as we’re mostly snapping together our models using analytically-simple pieces, the same part of the program that handles the computation of log-posterior densities should also be able to compute derivatives analytically.

I’ve been a big fan of automatic derivative-based MCMC methods since I started hearing about them a couple years ago (I’m thinking of the DREAM project and of Mark Girolami’s paper), and I too wonder why they haven’t been used more. I guess we can try implementing them in our current project in which we’re trying to fit models with deep interactions. I also suspect there are some underlying connections between derivative-based jumping rules and redundant parameterizations for hierarchical models.

It’s funny. Salvatier is saying what I’ve been saying (not very convincingly) for a couple years. But, somehow, seeing it in somebody else’s words makes it much more persuasive, and again I’m all excited about this stuff.

My only amendment to Salvatier’s blog is that I wouldn’t refer to these as “new” algorithms; they’ve been around for something like 25 years, I think.

1. NU says:

If someone put Girolami and Calderhead's algorithm into an CRAN R package with a fast C/C++ core, I bet a lot of people would use it, even without automatic differentiation.

For efficiency, it would be useful if an implementation accepted a function that computes the log-posterior and all the derivatives simultaneously, since sometimes the function and its derivatives contain common calculations.

2. jsalvatier says:

I think DREAM is difference based (between different points in the sampling history), not derivative based. I agree it has some nice properties though.

3. K? O'Rourke says:

The piece by jsalvati is very nicely written.

I strongly believe this comment:
"Bayesian statistics in particular is that it will make learning, understanding and using statistics much easier than it is presently."

But that to will take a lot of work to make clear to as maybe people as could benifit by really grasping why.

My current model is that MCMC is one of the barriers to understanding and that simple conjugate models (especially of low dimension) do more harm than good – for most.

I have progressed a bit on this http://www.stat.columbia.edu/~cook/movabletype/ar… – given a couple day long courses for non-statisticians using slightly less goofy material.

There are conceptual barriers – the biggest for non-statisticains or maybe even non-Bayesians is perhaps to accept a batch of numbers (a posterior sample) as _the_ answer!

K?

4. For those who don't already know, Hybrid Monte Carlo is increasingly popular in the machine learning community, particularly the "deep learning" corner of said community. Most recently, HMC has been key to the success of mean-covariance Restricted Boltzmann Machines ("mcRBMs") applied to vision (CVPR paper, AISTATS paper) and speech recognition (NIPS 2010 preprint). Well, mcRBMs plus GPU hardware.

The automatic differentiation angle is quite interesting as well, and I'm frankly baffled at how little it's employed in machine learning and related disciplines. The lab I work in has produced Theano, a Python library that allows users to symbolically specify a mathematical expression as a computation graph and then generate (on the fly) efficient CPU and GPU code for evaluating said expression and its gradients. We're certainly interested in user perspectives, suggestions and contributions from all walks of mathematical science, especially from communities as closely related to our own as the Bayesian statistics/hierarchical modelling community.

5. Tim says:

re: Theano:

I explored Theano and it seems great. As mentioned above, however, the way to get a bunch of scruffy statisticians to use your work is to hook it into R (with however thin a layer of glue is possible) and see what happens. This summer's GSoC products included 'radx', an R package for automatic differentiation, and I've personally been playing around with CppBugs and some of Girolami & Calderhead's examples to see how well I can make everything "go" using R to drive C++ and/or GPU code that does the heavy lifting.

So wrap your code before some idiot like me gets the credit for making it available ;-)

6. Bob Carpenter says:

Hybrid/Hamiltonian Monte Carlo is independent of how you compute the gradients.

My practical question is how important is automatic differentiation? In a BUGS-like graphical model, it seems easy enough to compute all the derivatives analytically. Is it much faster or more robust to compute with automatic differentiation for situations like covariance matrices for hierarchically modeled coefficients?