The fundamental abstractions underlying BUGS and Stan as probabilistic programming languages

Probabilistic programming languages

I think of BUGS and Stan as probabilistic programming languages because their variables may be used to denote random variables, with function application doing the right thing in terms of propagating randomness (usually encoding uncertainty in a Bayesian setting). They are not probabilistic programming languages that provide an object language for inference; in both languages, programs define probabilistic models and inference is carried out externally over these models conditioned on data.

BUGS programs

The fundamental abstraction of a BUGS program is that when instantiated with data, it produces a directed graphical model over which inference may be performed with generalized (not necessarily conjugate) Gibbs sampling.

A BUGS program is a sequence of statements involving a set of free variables. The statements may be assignments to (indexed) variables, sampling statements, or interval-bound loops.

Some of the variables are known (either constants or observed, modeled data) and provided to the program. The remaining variables are either loop index variables or unknown variables (parameters and missing data). Modeled data variables denote random variables with known values (i.e., they are used for conditioning), whereas unknown paramters or missing data denote random variables with unknown values.

Given the data to instantiate a subset of the variables, a BUGS program defines a directed graphical model where the nodes are either known and instantiated to fixed values or unknown. The sampling statements, assignment statements, and operators and functions used are sufficient to infer the shapes of each of the variables in a BUGS program. Conditions such as variables only being assigned to or sampled from once are also enforced (with the exception that modeled data may be assigned to, then sampled).

The Markov blanket of each unkown variable in the graphical model can be used to construct efficient probability functions for each unknown variable when conditioned on the other unknown variables and constants. This in turn may be used to implement a (generalized) Gibbs sampler that iterates through the unknowns, drawing a new sample for each conditioned on the current values of the other variables.

Because of its declarative nature, the order of operations in BUGS doesn’t matter. All that happens is that the contributions from all of the statements are used to build a graph, which is then interpreted during sampling.

Stan programs

The fundamental abstraction of a Stan program is that when instantiated with data, it produces a differentiable log density function over which inference may be performed with adaptive Hamiltonian Monte Carlo. Specifically, reverse-mode automatic differentiation is used to execute the log density function and build up a complete expression graph, which may be differentiated by propagating the chain rule from the result backwards through intermediate quantities to the unknown variables.

Stan programs form a sequence of blocks, with each block consisting of variable declarations and in some cases also statements. Unlike BUGS, Stan is strongly typed, with every variable being declared for type (integer, real, vector, row vector, matrix, or array of simpler type) and the size of each of its dimensions. Unlike BUGS, the use of each variable is also specified by the block in which it is defined, data, transformed data, parameter, transformed parameter, model, or posterior generated quantity.

Because of its imperative nature, variables must be defined in Stan before they are used. Data and parameter variables are defined externally (the data variables being provided by the user before inference begins, and the parameter values being provided by the sampling or optimization algorithm). This has caused a great deal of confusion among those with experience in BUGS who try to translate models to Stan.

Historical note

The original design of the Stan language was heavily influenced by BUGS, though its treatment of matrix types was more heavily influenced by MATLAB. We knew we needed to use Hamiltonian Monte Carlo for scalability, so we needed a language that would let us define a differentiable log density. I just thought about how to translate a sequence of statements in the BUGS language. Loops translated to loops, assignments to assignments, and sampling statements incremented the target log density (usually of a Bayesian posterior). It only later became apparent that we weren’t restricted to graphical models—we could reassign to variables, introduce conditionals, general while loops, user-defined functions, etc.

The design of the Stan math library was heavily influenced by the Trilinos package Sacado for forward- and reverse-mode automatic differentiation. If the log density function involved a templated scalar type, instantiating it with an autodiff type allows gradient calculation to be carried out with only constant overhead relative to the underlying log density function.

My main fear for the early Stan language as that users would pillory me for making them declare whether variables were known or unknown in the program itself. BUGS programs, like joint distributions, are agnostic about which of its variables are observed. Indeed, this has turned out to be a bottleneck in writing Stan programs to deal with missing data, which is where BUGS programs tend to lean on their flexibility.

I was optimistic that type declarations would be useful once users grew accustomed to them; I forgot my users weren’t low-level product programmers and weren’t used to thinking about types. So we continue to get a lot of complaints about Stan discriminating between vectors and arrays and between integers and real values.

16 thoughts on “The fundamental abstractions underlying BUGS and Stan as probabilistic programming languages

  1. I really like Stan’s strong types! But it occurs to me that even though I’m a statistician, my current position and the immediately previous one (both “data scientist”) do involve a lot of low-level product programming, so maybe I’m not typical of your user-base.

    Incidentally, a Stan model will be the basis of (one component of) an anomaly detection tool developed for use by the Bank of England. The guy who gritted his teeth and undertook to put it a Docker container and get our continuous integration tools to work with it was not happy with me, even afterI told him he’d done out-Stan-ding work; somehow the pun didn’t improve his outlook…

    http://www.bankofengland.co.uk/Documents/fintech/mindbridgeaipoc.pdf

  2. I’m with Cory. Strong types are unqualifiedly better. And the stronger the better. Of course my preferred programming language is Haskell, which may locate me somewhere out of the mainstream in terms of preferences. I also like a Matlab feel.

    • The one thing we haven’t been able to do is deal with typing for constraints. The run-time type system only deals with integers, reals, vectors, row vectors, matrices, and arrays of other types. It’d be nice to say that a multivariate-normal takes a positive-definite matrix argument, but we don’t have a good way to enforce that. The main problem is that it’s not clear how we’d enforce typing for local variables; for block variables, the constraints are either used for transforms (parameters) or to validate (data, transformed data, transformed parameters, generated quantities).

      I don’t think we’ll move to an ML-style type system (afraid I’ve never used Haskell), but we will be adding functional programming through simple well-founded typing where (A -> B) is a type if A and B are types. One of the things we’ll introduce before a proper type system is a parallel map function with signature (A -> B x B[]) -> A[].

      • Simply moving to an ML (technically Hindley-Milner) type system isn’t a good idea (and I say this as someone who is happy to go on the record that all programming languages suck to the extent to which they aren’t SML/Haskell – or maybe Lisp, at least on Tuesdays). The one area where H-M typing does not work very well is where problems have a numerical linear algebra feel. Much better to work on the Matlab (or better APL – APL did it right, Matlab was a step backwards) model.

        Theoretically, there could be a big win in a Haskell monad interface to an API for Stan – certainly the Haskell linear programming monad (or at least the one I have used) is unbelieveably flexible and expressive, but lets face it, that win is likely to stay theoretical, and anyway Stan models are much smaller than linear programming models can be, so the win is likely smaller anyway.

        • Missed adding – the sort of functions that you talk about look (at least) to be precisely the sorts of things that would work with an APL model.

        • I haven’t thought much about more Lisp/ML-like type systems since I was sitting in on Milner’s (and others) lectures on parallel type systems back in grad school in the mid-80s at Edinburgh :-)

          I’d love if someone could explain in applied, non-category-theoretic terms why monads are so attractive theoretically and how they are useful practically. If category theory is the only way they can be explained, they’ll certainly “stay theoretical.” Our users aren’t going to know lambda calculus either, but we are going to add closures for lambdas, so maybe we’ll nudge them a bit in that direction.

Leave a Reply to Corey Cancel reply

Your email address will not be published. Required fields are marked *