Over the past few months, we’ve talked about modeling with particle physicists (Allen Caldwell), astrophysicists (David Hogg, who regularly comments here), and climate and energy usage modelers (Phil Price, who regularly posts here).

**Big Science Black Boxes**

We’ve gotten pretty much the same story from all of them: their models involve “big science” components that are hugely complex and provided by outside implementations from labs like CERN or LBL. Some concrete examples for energy modeling are the TOUGH2 thermal simulator, the EnergyPlus building energy usage simulator, and global climate model (GCM) implementations.

These models have the character of not only being black boxes, but taking several seconds or more to generate the equivalent of a likelihood function evaluation. So we can’t use something like Stan, because nobody has the person years required to implement something like TOUGH2 in Stan (and Stan doesn’t have the debugging or modularity tools to support large-scale development anyway). Even just templating out the code so that we could use automatic differentiation on it would be a huge challenge.

**Sampling and Optimization**

Not surprisingly, these researchers tend to focus on methods that can be implemented using only black box log probability evaluators.

Phil told us that he and his colleagues at LBL use optimizers that are based on ensemble techniques like the (the Nelder-Mead method). They then try to estimate uncertainty by using finite difference-based estimates of the Hessian of their answer.

*Bayesian Analysis Toolkit (BAT)*

Allen Caldwell and his group (Daniel Kollar and Kevin Kröninger are listed as the core developers) are behind the Bayesian Analysis Toolkit (BAT), which is based on Metropolis. BAT requires users to implement a class in C++ for the model, but it can call any kind of external libraries it wants.

*emcee, the MCMC Hammer*

David Hogg and his group (Daniel Foreman-Mackey seems to be doing the heavy code lifting) are behind emcee, aka “the MCMC Hammer,” which is based on Goodman and Weare’s ensemble sampler, which was motivated by the Nelder-Mead method for optimization. emcee requires users to implement a log probability function in Python, which can then call C or C++ or Fortran on the back end.

We plan to add the Goodman and Weare ensemble method to Stan. We’re still working on how to integrate it into our sampling framework. Ensembles are tricky.

**Search Engine Optimization. Not!**

Perhaps we should’ve heeded Hadley Wickham‘s advice, which he followed in naming “ggplot2,” which is to pick something with zero existing hits on Google. Instead, we have three hard-to-search-for names, “stan,” “emcee,” and “bat.” Of course, “bugs” takes the cake in this competition.

One of the most frustrating aspects of my time in particle physics was this underlying reliance on “black boxes”, both literally (optimizers and other analysis software) and figuratively (rarely questioned assumptions, both experimental and theoretical, and techniques that everyone took for granted). Unfortunately it is not uncommon for students to approach physics mechanically — don’t ask question and don’t innovate too much. It’s far easier, more productive politically, and features a facade of objectivity that they suppose they should be striving towards. This “black box” approach to science isn’t good for the students and it’s not good for the science.

When I started working with the Andrew one of his philosophies that really resonated with me was the folk theorem. Your model is an approximation (trying to track down theory references for my thesis, only to discover that no one had actually proved certain results that everyone assumed to be true was pretty disheartening), so why not make it an approximation that’s not only consistent with the data but also much more computationally feasible? This requires really understanding your model (the data, the theory, everything); it’s incredibly difficult but by far the best way to maximize the utility of an analysis. It’s also the most fulfilling.

In science no model is sacrosanct, and neither is any “black box” algorithm. Stan comes close (HMC has a beautiful theoretical foundation demonstrating why it’s one of the only methods that will scale well with increasing model complexity) but in the end you have to keep asking questions and strive to find its limitations. Otherwise, for example, we might have given up on hierarchical models before rederiving non-centered parameterizations (i.e. the Matt trick)!

P.S. I am careful to point out that as bad as I’ve seen physics analyses get, the exceptions are phenomenal. For example, I love the discussions in Dave Hogg’s research blog (http://hoggresearch.blogspot.co.uk), even if I sometimes find myself yelling at the screen wishing I had been in the room to provide a counterpoint or two. And, honestly, most young students are inclined towards proper analyses, they just don’t have the training nor the freedom to follow that inclination to awesome science.

My understanding of the motivation for samplers with black-box components was that it was being done mainly so that users fitting model didn’t have to first spend several person years reimplementing a domain model. Yes, the domain models may not be the best, but they’re there. And I wouldn’t discard the trust of years of users in verifying that they do something reasonable.

One of the things Daniel and I discussed with Allen was that some of the underlying models were not identified in and of themselves. They involved things like simplex parameters. So we got off on a tangent of some of the things you could do, like reparameterize on the outside and apply a Jacobian adjustment, so that the parameters you sample are identified. I believe you once wrote something about that (who knew there was a video; nice lineup when you get to start a talk with “Radford just talked about HMC…”).

Hahaha — can we have Bob blocked for linking to such vulgar material? :-)

My point is that blindly accepting domain models is the ultimate problem. Of course you shouldn’t have to rederive everything from scratch and discarding years of experience, but you have to at least understand those models so that you can question their assumptions and understand how they’ll interact with different algorithms. I feel like I spent most of my time in grad school fixing bad assumptions (or outright bugs) in software and analyses that other people had established, and I never would have found those if I wasn’t trying to understand them. It was really, really hard but in the end I could present my work confident that I could identify all of the relevant assumptions.

What I love about Stan is two-fold: first we abstract the details of the algorithm away, letting users with the domain-specific knowledge build appropriately-complex models, second we leverage the rich diagnostics of HMC to identify poor interactions and potential sources of improvement. It’s very much an iterative process and it seems like every thread on stan-dev ends with some variant of “Oh, I understand my model so much better now”.

Wow! That reparameterization of the Dirichlet is beautiful! To me it’s reminiscent of the Mathworld’s derivation of the Beta function — a neat trick I’ve loved ever since I first read it — but I would never have made the connection to MCMC and Dirichlet.

Michael: Any time you feel like yelling, feel free to comment on the blog; comments are always open!

ps. I spend the summer with my head down getting work done and I miss all the fun on this blog!

>”They then try to estimate uncertainty by using finite difference-based estimates of the Hessian of their answer.”

That seems consistent with a Cramer-Rao lower bound, yes?

Provided that the mode is sufficiently well-behaved that the asymptotic behavior satisfies the conditions of the proof. Of course, complex distributions in high dimensions don’t often behave well.

> Of course, complex distributions in high dimensions don’t often behave well.

Understood. A better phrasing of my comment/question would have been “Estimating the Cramer-Rao lower bound is problematic when you’re dealing with a high-d complex distribution. Given a high-d complex distribution, estimating uncertainties using finite-difference-based estimates of the Hessian is analogous to computing the CRLB and would produce (essentially) the same result for a low-d well-behaved distribution, yes?”

Essentially; given enough regularity conditions the Hessian will converge to the Fisher information and will yield an equivalent CRLB.

> “…given enough regularity conditions the Hessian will converge to the Fisher information…”

Thanks.

Related topic: Pros and cons of Nelder-Mead relative to simulated annealing? You could estimate the Hessian from the SA history. (I was PI on a project where we needed to solve a variant of the art gallery problem. In addition to recommending observer locations we had to report the tolerances associated with our recommendations. We used SA to determine locations and map the cost function then numerical estimates of the Hessian for tolerance calculation.)

Simulated annealing is a bit less sensitive to tuning (NM has lots of potential variants) but they can both run into problems when the curvature changes sharply. I use nested sampling for these kinds of searches, but then again I don’t optimize when I can marginalize.

Recommended refs for NM? FWIW, I hadn’t heard of it before this post and my only experience with SA is the one I summarized above. Gauss-Newton and Levenberg-Marquardt have been well-suited for almost all of the optimization problems I’ve had to deal with.

If you can compute derivatives then you really don’t need NW or SA. These algorithms are for when the gradients are infeasible and you need to search “blindly”.

> If you can compute derivatives then you really don’t need NW or SA. These algorithms are for when the gradients are infeasible and you need to search “blindly”.

Understood. My interest in SA and NW arises from the fact that sometimes I encounter problems GN and LM don’t work. I’d like to develop a better sense of what tools I could apply to problems where GN and LM won’t do the job. (SA worked well in the case where I applied it but I can’t claim to have much insight into whether I ignored better alternatives.) Looking at it from a slightly different perspective, having a more complete toolkit allows one to attack a broader (and potentially more interesting) range of problems.

Just so I understand the process, does this mean…

1. Stan proposes a parameter set i

2. Big model takes parameter set i and generates log-likelihood i

3. Stan receives log-posterior i and generates parameter set i+1, chosen to best sample the parameter space

4. Iterate until some metrics suggest the posterior is well sampled.

I think this could be really useful. Our models are probably cute toys compared to the models described in the post, but still take a while to run.

One additional question is ease of parallelization — I currently use Raftery’s IMIS approach because I can spread the work out over a cluster easily. Even a less efficient algorithm could look really good if you can spread out the task of evaluating the likelihood over a lot of CPUs.

The procedure (1-4) is how the algorithms Bob mentioned, BAT and emcee, work but not how Stan works. In particular, HMC requires not only the value of the log posterior but also its gradient (and even higher order derivatives for more advanced methods). Now we don’t require that the user specify these in Stan, instead we compute them automatically using automatic differentiation. This is why users are limited to specifying models in our modeling language.

Parallelization is great, but you have to remember that the linear scaling of more cores will never beat the exponential scaling of computational difficulty (if not combinatorical!) of algorithms like random walk Metropolis. HMC isn’t susceptible to that poor dimensionality scaling, but is also far less trivially parallelizeable, although we’ve continuously making such considerations.

How about Gaussian process emulators etc? You could use the gradients from the Gaussian process (http://www.kyb.mpg.de/publications/pdfs/pdf2080.pdf) or use it to make design of experiment style optimisation of the next parameter values to use for running the model (several papers, but in the haste could not find them).

Aki,

The problem in practice is that the deviation between the true distribution and the emulator grows with dimension (unless you add exponentially more points to the GP) and the accuracy of the trajectory rapidly deteriorates. High dimensions are no fun.

Bob’s story didn’t mention dimensionality, so how high dimensional problems you are interested in?

Good question! Hundreds of dimensions are not uncommon, and I think some of the “big” problems can reach thousands of parameters. My main concern is building algorithms that scale reasonably well with dimension so that there’s a not a “phase transition” where an algorithm all of a sudden stops working. Unfortunately, lots of approximations and ad hoc methods that work really well in low dimensions fall apart quickly as the dimensions increase. Another reason to love HMC!

I have what I consider to be a pretty simple toy problem that is nevertheless fairly typical of this kind of thing. It’s part of my dissertation which I’m just finishing up. If you guys would like to discuss using it as a testbed for development of methods I would be happy to discuss it, especially if there were some consulting fees we could arrange under your grant or something. The problem goes something like this:

I ran about 120 molecular dynamics simulations of elongated rods of molecules (these took about a month last year, and I collected lots of statistics about the runs). I imposed random pulse wave initial conditions on the rods, and then I observed the rods vibrating by capturing statistics of the molecular count, velocities, kinetic and potential energies in about 20 slices of each rod for 8 or so periods of wave oscillation. I have an ODE which is designed to emulate these 20 statistical averages through time. The ODE depends on inferring quantities like exact length of the rod, an exact timescale that depends on density of the rod, and a value for a certain diffusion coefficient in the ODE which depends on temperature, and observational size and the specifics of the crystal structure of the rod (taken to be random).

In the FULL inference you’d want to fit the 20 velocities for each rod (120 rods) across all the time points (I have maybe 1000 time points I think). So let’s say each evaluation of the likelihood takes solving 120 sets of 40 dimensional ODEs giving evaluations at 1000 time points, and involves potentially around 1200 unknown parameters. In the version I actually carried out for my dissertation it was much reduced, along the lines of ABC methods in which the wave energy was used as an approximately sufficient statistics.

All this to determine something about how thermal noise induces wave dissipation and how that dissipation scales with the size of the object.

This is more or less what I’d call a “small” problem. For a “big” problem, like say tomography of the earth’s crust from seismic observations, you might have a few thousand earthquakes each arriving at a few hundred observation stations, along paths parameterized by a hundred coefficients in a series parameterization which travel through a rock structure parameterized by a thousand coefficients. So let’s say potentially a couple million or 10 million parameters. These would all be more or less nuisance parameters except for the thousand rock structure parameters.

I don’t think the dimensionality is the problem in the big science problems. Though in the kind of exoplanet finding application that David Hogg is up to, there are 20K stars under observation for years and a fairly complicated model for periodicity to see if you see planets obscuring the star’s brightness. So that’s pretty large.

For the kinds of things we’re working on, latent parameterizations in the thousands or tens of thousands are not uncommon. For instance, running IRT on a large number of students nested in schools for a decent number of questions can quickly yield thousands of parameters. But nobody’s trying to fit covariance on them all! So it’s just O(K), not O(K^2).

Bob: “I don’t think the dimensionality is the problem in the big science problems.”

As Einstein put it: “If I had an hour to solve a problem I’d spend 55 minutes thinking about the problem and 5 minutes thinking about solutions.”

Search engine optimisation: if you think “BUGS” is hard to search for, you must never have needed help with “R”!