Here’s Michael Betancourt writing in 2015:

Leveraging the coherent exploration of Hamiltonian flow, Hamiltonian Monte Carlo produces computationally efficient Monte Carlo estimators, even with respect to complex and high-dimensional target distributions. When confronted with data-intensive applications, however, the algorithm may be too expensive to implement, leaving us to consider the utility of approximations such as data subsampling. In this paper I demonstrate how data subsampling fundamentally compromises the scalability of Hamiltonian Monte Carlo.

But then here’s Jost Springenberg, Aaron Klein, Stefan Falkner, and Frank Hutter in 2016:

Despite its successes, the prototypical Bayesian optimization approach – using Gaussian process models – does not scale well to either many hyperparameters or many function evaluations. Attacking this lack of scalability and flexibility is thus one of the key challenges of the field. . . . We obtain scalability through stochastic gradient Hamiltonian Monte Carlo, whose robustness we improve via a scale adaptation. Experiments including multi-task Bayesian optimization with 21 tasks, parallel optimization of deep neural networks and deep reinforcement learning show the power and flexibility of this approach.

So now I’m not sure what to think! I guess a method can be useful even if it doesn’t quite optimize the function it’s supposed to optimize? Another twist here is that these deep network models are multimodal so you can’t really do full Bayes for them even in problems of moderate size, even before worrying about scalability. Which suggests that we should think of algorithms such as that of Springenberg et al. as approximations, and we should be doing more work on evaluating these approximations. To put it another way, when they run stochastic gradient Hamiltonian Monte Carlo, we should perhaps think of this not as a way of tracing through the posterior distribution but as a way of exploring the distribution, or some parts of it.

To some extend I’m not surprised some version of subsampling works for GPs because I find it hard to believe that the having N data points justifies N (highly correlated) degrees of freedom. Obviously there’s been a huge push in the last decade or so to reduce this dimension to either a small number of effective dofs a priori (predictive processes [i forget their name in machine learning!], fixed-rank kriging, optimal projections, etc) or a large number of structured dofs a priori (spde markov models, some examples of fixed rank kriging, some wavelet methods, etc), or small number of dofs a posteriori (variational GPs etc). These have met with varying degrees of success.

Martin Wainwright at Berkley has really worked hard to show that appropriate random projections/subsamples (sketches in his nomenclature) can turn structured high dimensional penalised maximum likelihood problems (like GP regression) into much lower dimensional problems. For example, he argues that for a squared exponential kernel, you really only need to solve a log(N) x log(N) linear system, which makes the cost O(log^3(N)) rather than O(N^3).

So the bones of this method will definitely work.

My problem with (or scepticism towards, i guess) all of these subsampled MCMC methods is that I think they’re generalising the frequentist intuition incorrectly. Frequentist methods perform a sketch and then perform inference on the sketched problem. This makes sense: if the problem was well behaved before (eg the log likelihood had 2 continuous derivatives), then the same will be true of the sketched problem. So I have no problem with the Bayesian version of this, where the problem is sketched and then “exact” Bayesian inference is performed on the sketched problem (as if it was the true problem). It would need to be shown that the posteriors were then close (and they will be).

When the subsampling/sketching is repeated at each step of the algorithm, then bad things often happen (and I think this is what Betancourt was talking about in his paper). Think about finding the MAP estimate with Newton’s method. If you compute a new subsample for every function/gradient/hessian evaluation, the likelihood will no longer be twice differentiable and the method will probably not converge. MCMC is trying to solve a harder problem than just maximisation, so this doesn’t bode well for it. In fact, we know how to fix this problem using a pseudo-marginal construction, but we also know that this typically lowers the ergodicity class (eg the subsampled version of a geometrically ergodic MCMC algorithm will usually not be geometrically ergodic).

TL;DR: Maybe this works, but, as Phil Collins says, it’s against all odds.

Related to this: If you compute k approximate posteriors from k different subsampled data sets, how should you combine them?

Daniel:

We have EP-like algorithms: http://www.stat.columbia.edu/~gelman/research/unpublished/ep.pdf

That paper is a couple years old; we’re preparing a new version.

Point of clarification: the paper uses a neural network rather than a GP as the function approximator.

Hey Dan, would it be possible for you to provide a link to Wainwright’s work on GP solution costs? The guy’s too prolific — I’m having trouble tracking it down in amongst all his other work.

+1 to Dan’s comments.

The pseudo-marginal constructions not only lower the ergodicity class but they’ll ruin the coefficients, too, leading to extremely slow performance. In high-dimensions there’s no way to recover from throwing away information!

Ultimately, I hate these narratives of making all kinds of bad approximations because you have to use ALL OF THE DATAS. In practice you gain so much more from doing a robust, principled analysis on a smaller set of data, and then increasing the data size only if you need it.

Michale:

XL Meng formalized this question nicely “just how much of a defect in data quality can be cancelled out by a very large amount of it?”

https://www.youtube.com/watch?v=8YLdIDOMEZs (starts around 16 minutes)

Perhaps something similar could be done for “just how much of a defect in data analysis approximation can be cancelled out by a very large amount of it additionally being analysed with the approximation?”

Sorry about the typo Michael.

Has anyone looked into the gradient approximation used by SPSA? http://www.jhuapl.edu/spsa/

Obviously if you’ve got Stan doing auto-diff then this is less relevant, but the auto-diff is itself expensive. So, it could be good to have as an alternative a numerical differentiation using something like SPSA. Obviously the SPSA method gives you a noisy estimate of the gradient, but it isn’t the kind of noise you get from sub-sampling the data.

In particular, I’m thinking about a situation where you have a large number of parameters. Perhaps for example 10,000 parameters. Numerical computation of the gradient using a “standard” finite difference method would take something like 10,000 evaluation of the function, at each gradient calculation. Using the SPSA method you get an unbiased but noisy estimate of the gradient with a single function evaluation. Using N << 10,000 SPSA evaluations averaged might give a sufficiently accurate gradient for less cost than auto-diff for certain classes of problems especially those with large parameter vectors and/or for situations where you're using ODEs or the like.

Daniel,

Auto-diff in Stan is much faster then numerical differentiation. So I don’t think there’s any reason to use this sort of numerical method as you suggest.

Take a look at SPSA, I doubt it’s true that autodiff is faster as the number of dimensions increases. In essence, autodiff simultaneously evaluates F(a,b,c,d,e…) and F_a(a,b,c,d,e…) F_b(a,b,c,d,e…) etc. That is, you get simultaneous evaluation of 10,000 functions when there are 10,000 dimensions. You do it in a smart way, so it’s faster than doing (F(a+da,b,c,d,…) – F(a,b,c,d,…) ) / da …. 10,000 times, but that’s not what SPSA does.

with the SPSA approximation, you do

(F(a+da,b+db,c+dc, d+dd,….) – F(a,b,c,d,…) ) for a particular kind of random (da,db,dc,….) vector

and then divide this quantity by da, db, dc to get an unbiased approximation to the gradient. That is, you do ONE extra function evaluation to get the derivative.

Of course, this gives you only a noisy unbiased estimate of the gradient. In the SPSA literature it’s been shown that for optimization, that’s all you need. But perhaps for Hamiltonian Monte Carlo, you might need more. But there’s no reason you couldn’t do the same type of evaluation more than once at each step, but more than once and a lot less than 10,000 times!!!

Basically what I’m saying is comparing auto-diff to standard numerical differentiation requiring N extra function evaluations for an N dimensional space could be an obvious choice in favor of auto-diff, but it’s not so obvious when compared to some variant of SPSA based numerical diff requiring only one extra function evaluation, or perhaps the average of a small number of extra function evals.

Also, I wonder if the noise in the derivative evaluation through SPSA could be incorporated into the theory of Hamiltonian Monte Carlo. Specifically, taking a cue from the Physics motivation, suppose you have a small system inside a rigid container, then your potential energy is a function only of the positions of the molecules of your system. Now, suppose you have a system inside a container that is only rigid on average (that is, over time the average positions of the molecules of the walls of the box are fixed, but the individual positions of molecules vary in time) now, the potential energy is not just a function of the position of your system molecules, but also random perturbations of the boundary molecules. Perhaps we can analogize the error in the SPSA approximation to this scenario, especially because the SPSA error is zero-mean. That is, if we treat the dynamics under the SPSA approximation as if they were equivalent to dynamics where the gradient was computed exactly but the function being computed varies rapidly in time and we want to calculate the average long-time behavior.

There are lots of multi-time-scale molecular dynamics studies in which the long-time behavior of a system that has fast dynamics is evaluated using some kind of averaging scheme. Using something like http://repository.cmu.edu/cee/78/ or https://www.researchgate.net/profile/Mark_Tuckerman/publication/225092520_Reversible_Multiple_Time_Scale_Molecular-Dynamics/links/0c96052d41180be851000000.pdf or any number of related multi-time-scale papers.

Daniel,

It depends on the model, but in a lot of cases autodiff just costs a fixed multiple of the cost of the function evaluation, even as the number of dimensions increases.

This would also be true of SPSA, and it would be model independent. Of course auto-diff is going to be lower-noise, but as I say perhaps it’s possible to incorporate the noise in a constructive way using an alternative fast-slow integration scheme.

No doubt, auto-diff is cool, but so is SPSA and it might be useful to have alternatives, especially for those models where auto-diff isn’t a fixed multiple of the cost of a function eval. For example, in ODE solvers, I think the way you’re handling it is to calculate the derivatives with respect to the parameters as additional dimensions in the ODE solution (I could be mistaken though). With SPSA you’d simply integrate the system twice once at the original parameter point and once at the simultaneously perturbed parameter point. You could imagine that if you have a 1 dimensional ODE with say 50 parameters and hyper-parameters involved, it’d now become a 51 dimensional ODE under auto-diff but just two runs of a 1D ODE under SPSA. This is obviously the extreme example.

The ‘fixed multiple of the cost of the function evaluation’ bound is I think also model / function independent for (reverse-mode) automatic differentiation (AD), assuming the cost is in terms of operation count / time complexity. In particular, for reverse-mode AD of a scalar-valued function of a vector input, it has been shown (see e.g. [1] and [2]) that is always possible to evaluate the gradient of the function with respect to the vector input at a operation count cost of at most six times the forward evaluation of the function and more typically two to three times the cost.

The space complexity scales with the number of intermediate results that need to be stored from the forward pass for the gradient propagation in the reverse pass however. Therefore in reality for very large computational graphs where memory becomes a bottleneck it might not be possible in practice to compute the gradient at a constant multiple of the forward function cost. Integrating ODEs over a large number of timesteps would I guess be a good example of a potentially large computational graph so it could be that SPSA does offer an advantage here by side-stepping the memory overhead issues. However it would need to be a sufficiently large system that it is not possible to store all intermediate states in memory for there to be a large gain. If the ODEs / integrator are time reversible it is also possible to do reverse-mode AD in a memory efficient way even for large systems / many timesteps by recomputing the intermediate states on the fly in the reverse pass [3].

SPSA does also have an implementation advantage of not requiring implementing the mode in a language / framework which supports reverse-mode AD, so in the sense that it is more black-box it might also be useful in cases where it is infeasible to rewrite the model code to allow reverse-mode AD.

[1] : Automatic differentiation in machine learning: a survey

Baydin, Pearlmutter, Radul and Siskind

https://arxiv.org/abs/1502.05767

[2] : Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, Second Edition

Chapter 4: Memory Issues and Complexity Bounds

Griewank and Walther

http://dx.doi.org/10.1137/1.9780898717761.ch4

[3] : Gradient-based Hyperparameter Optimization through Reversible Learning

Maclaurin, Duvenaud and Adams

https://arxiv.org/abs/1502.03492