The fundamental abstractions underlying BUGS and Stan as probabilistic programming languages

Probabilistic programming languages

I think of BUGS and Stan as probabilistic programming languages because their variables may be used to denote random variables, with function application doing the right thing in terms of propagating randomness (usually encoding uncertainty in a Bayesian setting). They are not probabilistic programming languages that provide an object language for inference; in both languages, programs define probabilistic models and inference is carried out externally over these models conditioned on data.

BUGS programs

The fundamental abstraction of a BUGS program is that when instantiated with data, it produces a directed graphical model over which inference may be performed with generalized (not necessarily conjugate) Gibbs sampling.

A BUGS program is a sequence of statements involving a set of free variables. The statements may be assignments to (indexed) variables, sampling statements, or interval-bound loops.

Some of the variables are known (either constants or observed, modeled data) and provided to the program. The remaining variables are either loop index variables or unknown variables (parameters and missing data). Modeled data variables denote random variables with known values (i.e., they are used for conditioning), whereas unknown paramters or missing data denote random variables with unknown values.

Given the data to instantiate a subset of the variables, a BUGS program defines a directed graphical model where the nodes are either known and instantiated to fixed values or unknown. The sampling statements, assignment statements, and operators and functions used are sufficient to infer the shapes of each of the variables in a BUGS program. Conditions such as variables only being assigned to or sampled from once are also enforced (with the exception that modeled data may be assigned to, then sampled).

The Markov blanket of each unkown variable in the graphical model can be used to construct efficient probability functions for each unknown variable when conditioned on the other unknown variables and constants. This in turn may be used to implement a (generalized) Gibbs sampler that iterates through the unknowns, drawing a new sample for each conditioned on the current values of the other variables.

Because of its declarative nature, the order of operations in BUGS doesn’t matter. All that happens is that the contributions from all of the statements are used to build a graph, which is then interpreted during sampling.

Stan programs

The fundamental abstraction of a Stan program is that when instantiated with data, it produces a differentiable log density function over which inference may be performed with adaptive Hamiltonian Monte Carlo. Specifically, reverse-mode automatic differentiation is used to execute the log density function and build up a complete expression graph, which may be differentiated by propagating the chain rule from the result backwards through intermediate quantities to the unknown variables.

Stan programs form a sequence of blocks, with each block consisting of variable declarations and in some cases also statements. Unlike BUGS, Stan is strongly typed, with every variable being declared for type (integer, real, vector, row vector, matrix, or array of simpler type) and the size of each of its dimensions. Unlike BUGS, the use of each variable is also specified by the block in which it is defined, data, transformed data, parameter, transformed parameter, model, or posterior generated quantity.

Because of its imperative nature, variables must be defined in Stan before they are used. Data and parameter variables are defined externally (the data variables being provided by the user before inference begins, and the parameter values being provided by the sampling or optimization algorithm). This has caused a great deal of confusion among those with experience in BUGS who try to translate models to Stan.

Historical note

The original design of the Stan language was heavily influenced by BUGS, though its treatment of matrix types was more heavily influenced by MATLAB. We knew we needed to use Hamiltonian Monte Carlo for scalability, so we needed a language that would let us define a differentiable log density. I just thought about how to translate a sequence of statements in the BUGS language. Loops translated to loops, assignments to assignments, and sampling statements incremented the target log density (usually of a Bayesian posterior). It only later became apparent that we weren’t restricted to graphical models—we could reassign to variables, introduce conditionals, general while loops, user-defined functions, etc.

The design of the Stan math library was heavily influenced by the Trilinos package Sacado for forward- and reverse-mode automatic differentiation. If the log density function involved a templated scalar type, instantiating it with an autodiff type allows gradient calculation to be carried out with only constant overhead relative to the underlying log density function.

My main fear for the early Stan language as that users would pillory me for making them declare whether variables were known or unknown in the program itself. BUGS programs, like joint distributions, are agnostic about which of its variables are observed. Indeed, this has turned out to be a bottleneck in writing Stan programs to deal with missing data, which is where BUGS programs tend to lean on their flexibility.

I was optimistic that type declarations would be useful once users grew accustomed to them; I forgot my users weren’t low-level product programmers and weren’t used to thinking about types. So we continue to get a lot of complaints about Stan discriminating between vectors and arrays and between integers and real values.