Here’s Michael Betancourt writing in 2015:
Leveraging the coherent exploration of Hamiltonian flow, Hamiltonian Monte Carlo produces computationally efficient Monte Carlo estimators, even with respect to complex and high-dimensional target distributions. When confronted with data-intensive applications, however, the algorithm may be too expensive to implement, leaving us to consider the utility of approximations such as data subsampling. In this paper I demonstrate how data subsampling fundamentally compromises the scalability of Hamiltonian Monte Carlo.
But then here’s Jost Springenberg, Aaron Klein, Stefan Falkner, and Frank Hutter in 2016:
Despite its successes, the prototypical Bayesian optimization approach – using Gaussian process models – does not scale well to either many hyperparameters or many function evaluations. Attacking this lack of scalability and flexibility is thus one of the key challenges of the field. . . . We obtain scalability through stochastic gradient Hamiltonian Monte Carlo, whose robustness we improve via a scale adaptation. Experiments including multi-task Bayesian optimization with 21 tasks, parallel optimization of deep neural networks and deep reinforcement learning show the power and flexibility of this approach.
So now I’m not sure what to think! I guess a method can be useful even if it doesn’t quite optimize the function it’s supposed to optimize? Another twist here is that these deep network models are multimodal so you can’t really do full Bayes for them even in problems of moderate size, even before worrying about scalability. Which suggests that we should think of algorithms such as that of Springenberg et al. as approximations, and we should be doing more work on evaluating these approximations. To put it another way, when they run stochastic gradient Hamiltonian Monte Carlo, we should perhaps think of this not as a way of tracing through the posterior distribution but as a way of exploring the distribution, or some parts of it.