We need help picking out an automatic differentiation package for Hamiltonian Monte Carlo sampling from the posterior of a generalized linear model with deep interactions. Specifically, we need to compute gradients for log probability functions with thousands of parameters that involve matrix (determinants, eigenvalues, inverses), stats (distributions), and math (log gamma) functions. Any suggestions?
The Application: Hybrid Monte Carlo for Posteriors
We’re getting serious about implementing posterior sampling using Hamiltonian Monte Carlo. HMC speeds up mixing by including gradient information to help guide the Metropolis proposals toward areas high probability. In practice, the algorithm requires a handful or of gradient calculations per sample, but there are many dimensions and the functions are hairy enough we don’t want to compute derivaties by hand.
Auto Diff: Perhaps not What you Think
It may not have been clear to readers of this blog that automatic differentiation isn’t just an automatic procedure for symbolic differentiation or a finite differences method.
The way auto differentiation works is by applying the chain rule to all the basic expressions and assignments in your source code. Wikipedia’s entry for automatic differentiation article provides a nice high-level introduction.
Op Overload or Source Transform?
There are two common approaches, operator overloading and source transforming. Source transformers read a program and generate a new program to compute derivatives and the function at the same time. Operator overloading uses feature of C++ and Fortran 90 that lets you redefine behavior of functions for your types. Thus numerical types are replaced with types representing the function value and the derivative value, and operations with updates on both at the same time.
It seems to me that source transforming would be way more efficient. But mileage may vary by implementation — there’s lots of opportunity for compiler optimization techniques to come into play to eliminate intermediate computations.
Forward and Reverse Chain Rule
You can apply the chain rule to f(g(h(x))) working in from f to h or out from h to f (or in combinations of the two). Auto-diff packages almost all handle the forward direction, which is easier. By all accounts, the reverse mode is better for gradients, which is we need. But not all packages implement reverse mode at all, and not as deeply as they implement forward mode. A variable-by-variable application of forward-mode auto-differentiation would require a step through the whole log probability function for each dimension, which would be prohibitive in our context.
The problem is that reverse mode’s harder to implement. If understanding the doc correctly, some of the packages investigated, like Sacado, a part of the Trilinos scientific computing suite, have BLAS/LAPACK matrix library with their forward mode auto differentiation but not their reverse mode. Many of the open source packages are also tied up with larger suites of software of which they are a part, making it rather confusing to install them and tease apart what they do.
Yes, we care about CPU time, not just number of steps. If we had forever, we’d just run BUGS. By the same argument, we also care about memory usage. If we had all the memory in the world, we’d just run BUGS.
Some of the approaches to automatic differentiation use disk to write out intermediate results. Or compile “tapes” which are then interpreted, which also seems like it’d be slow (though I have no experience).
Just Point AD at Your Code?
For the operator overloading approaches, which seem to be more used because of the difficulty of source transforming C++, it’s not quite as simple as pointing an AD program at your code. At the least, everything I’ve looked at requires the numerical types to replaced with general templated type variables or variables defined within the AD package itself. That’s where things get messy with the matrix and stats libraries.
Licenses that Play Nicely with Others
We also want a package with a license that plays nicely with others, so that we can redistribute the tools we build. That means at least GPL, better yet LGPL, or best of all, Apache or BSD licenses. I (at least) am not interested in touching anything with an “academic use only” license.
The Bleg: Reprise
So, any suggestions to what we should be looking at? My head’s spinning from trying to sort things out from auto-generated doc and examples that are not geared toward involved reverse-mode gradient computations.