Hello, world! Stan, PyMC3, and Edward

Being a computer scientist, I like to see “Hello, world!” examples of programming languages. Here, I’m going to run down how Stan, PyMC3 and Edward tackle a simple linear regression problem with a couple of predictors.

No, I’m not going to take sides—I’m on a fact-finding mission. We (the Stan development team) have been trying to figure out whether we want to develop a more “pythonic” interface to graphical modeling in Stan. By the way, if anyone knows any guides to what people mean by “pythonic”, please let me know—I’m looking for something like Bloch’s Effective Java or Myers’s Effective C++. Rather than reinventing the wheel, I’ve been trying to wrap my head around how PyMC3 and Edward work. This post just highlights some of my observations. Please use the comments to let me know if I’ve misunderstood or misrepresented something in PyMC3 or Edward; I really want to understand what they’re doing more clearly. If I’m way off base, I’ll update the main post.

Bayesian linear regression

I’ll use a simple linear regression in all three frameworks to give you an idea of what it looks like. That is, we’ll use the sampling distribution

$y_n \sim \mbox{Normal}(\alpha + \beta_1 x_{n,1} + \beta_2 x_{n,2}, \sigma)$

and the priors

$\alpha, \beta_1, \beta_2 \sim \mbox{Normal}(0, 10)$

$\sigma \sim \mbox{Normal}(0, 1)$, restricted to $\sigma > 0$

In Edward, I went with their example prior, which is a lognormal on variance,

$\log \, \sigma^2 \sim \mbox{Normal}(0, 1)$, again restricted to $\sigma > 0$.

Hello, world! Stan

data {
int N;
vector[N] y;
matrix[N, 2] x;
}
parameters {
real alpha;
vector[2] beta;
real<lower=0> sigma;
}
model {
alpha ~ normal(0, 10);
beta ~ normal(0, 10);
sigma ~ normal(0, 1);
y ~ normal(alpha + x * beta, sigma);
}


Hello, world! Edward

import edward as ed
import numpy as np
import tensorflow as tf

from edward.models import Normal

X = tf.placeholder(tf.float32, [N, 2])
sigma = tf.sqrt(tf.exp(tf.Variable(tf.random_normal([]))))
beta = Normal(loc=tf.zeros(2), scale=tf.ones(2))
alpha = Normal(loc=tf.zeros(1), scale=tf.ones(1))
y = Normal(loc=ed.dot(X, beta) + alpha, scale=sigma * tf.ones(N))


I’m not 100% sure this would work—I borrowed pieces of examples from their Supervised Learning (Regression) tutorial and their Linear Mixed Effects Models tutorial.

Hello, world! PyMC3

This is copied directly from the official Getting Started with PyMC3 tutorial:

from pymc3 import Model, Normal, HalfNormal

basic_model = Model()

with basic_model:

# Priors for unknown model parameters
alpha = Normal('alpha', mu=0, sd=10)
beta = Normal('beta', mu=0, sd=10, shape=2)
sigma = HalfNormal('sigma', sd=1)

# Expected value of outcome
mu = alpha + beta[0]*X1 + beta[1]*X2

# Likelihood (sampling distribution) of observations
Y_obs = Normal('Y_obs', mu=mu, sd=sigma, observed=Y)


It’s not quite standalone as is—there are free variables Y, X1, X2. In the tutorial, they make sure the necessary data variables (Y, X1, X2) are defined in the environment before attempting to define the model. So they start their tutorial by running this simulation first:

import numpy as np
import matplotlib.pyplot as plt

# Initialize random number generator
np.random.seed(123)

# True parameter values
alpha, sigma = 1, 1
beta = [1, 2.5]

# Size of dataset
size = 100

# Predictor variable
X1 = np.random.randn(size)
X2 = np.random.randn(size) * 0.2

# Simulate outcome variable
Y = alpha + beta[0]*X1 + beta[1]*X2 + np.random.randn(size)*sigma


So you really need to consider the definitions of Y, X1, X2 as part of the model specification. And keep in mind that alpha, sigma, beta here are not the same variables as defined in the model scope previously; they are masked in the basic_model environment by the definitions of those variables there.

Model size

Both Edward and PyMC3 model definitions are substantially shorter than Stan’s. That’s largely because of Stan’s standalone static type definitions—the actual model density is the line-for-line similar in all three interfaces.

What’s happening in both PyMC3 and Edward is that the distribution functions are defining the shapes and sizes of the variables. And Python itself is dynamically typed, so variables just get their types from what is assigned to them.

Joint density model abstraction and data binding

In both Stan and Edward, the program defining a model defines a joint log density $\log p(y, \theta)$ that acts as a function from data sets to concrete posterior densities. In both Stan and Edward, the language distinguishes data variables from parameter values and provides an object-level representation of data variables.

In PyMC3, the data is included as simple Python types in the model objects as the graph is built. So to get a model abstract, you’d have to write a function that takes the variables as arguments then returns the model instantiated with data. The definition of the deterministic node mu here is in terms of the actual data vectors X1 and X2—these aren’t placeholders, their values are used from the containing environment. The definition of the stochastic node Y_obs explicitly includes actual data through observed=Y. This maneuver is necessary because they couldn’t assign a stochastic node to Y (well, they could, seeing as it’s Python, but it wouldn’t do what a naive user might expect). The parameter values from the simulation (alpha, beta, sigma) are not used in the model definition—rather, they’re masked (redefined) in the scope of the basic_model using the with statement.

I wonder how much of this behavior of PyMC3 is due to building on top of Theano (the symbolic differentiation package they use for gradients). I couldn’t figure out how to do code generation autodiff or symbolic autodiff any other way back when I was thinking about it and it’s one of the main reasons we went with templated autodiff instead.

Like Stan, Edward defines a function rather than binding in the data directly to the model. The tf.placeholder() business is defining lazy arguments to the model (a lambda abstraction in more formal terms). As with Stan, Edward then lets you plug in data for the placeholders to produce an instantiated model implementing the posterior density determined by conditioning on that data. (Here’s the TensorFlow doc on placeholder.)

In BUGS and JAGS, the model definition didn’t specify what variables are data and which is parameters, but it is fixed when the model is run with data by inspecting the data.

The compilation bottleneck for Stan is when the program is compiled into the joint density (a C++ executable in CmdStan, an Rcpp object in RStan, or a Cython object in Python). Adding the data is fast as it only binds the data $y$ in the joint density to create a concrete posterior density; in implementation terms, we construct an instance of an immutable C++ class from the data. I’m pretty sure Edward would work the same way. In PyMC3, the compilation down to Theano must only happen after the data is provided; I don’t know how long that takes (seems like forever sometimes in Stan—we really need to work on speeding up compilation).

Variable sizes and constraints inferred from distributions

In PyMC3, shape=2 is what determines that beta is a 2-vector.

The PyMC3 program also explicitly uses the half-normal distribution because they implicitly use the sampling distribution to define constraints on the parameters, so that they can use the same kind of underlying unconstraining transforms as Stan under the hood in order to run HMC on an unconstrained space.

Edward doesn’t seem to support constraints (at least not in their intro tutorials—they transform variables manually, which is easy for lognormal, but substantially harder elsewhere). At least that’s what I’m guessing is the reason behind their formulation of the prior on scale, tf.sqrt(tf.exp(tf.Variable(tf.random_normal([])))).

Edward is also different than PyMC3 and Stan in that it broadcasts up the parameters so that they are all the same size. That’s the purpose of the size in scale=tf.ones(2) in the expression defining beta and also the purpose of the multiplication in scale=sigma * tf.ones(N) in the expression defining y.

Stan and PyMC3 use more standard vectorized functions that broadcast internally. Also, Stan, like BUGS and JAGS, allows truncations for univariate distributions. It means Stan can get away without a separate half-normal distribution. I don’t know if truncation is possible in either Edward or PyMC3.

Non-trivial use of embedding?

One nice aspect of embedded languages is that you can use all the language environment tools and can use the language itself. One commonly mentioned benefit is autocompletion in the REPL environment (interactive Python interpreter). We can do a bit of that in Stan in emacs and in Rstudio for R, but it’s hardly the smooth embedding of PyMC3 or Edward.

I couldn’t find examples in either Edward or PyMC3 that make non-trivial use of the embedding in Python. All the examples are just scripts (sequences of Python statements). I’d like to see an example in which they take advantage of being embedded in Python to build something like a hierarchical model component that could be used with other models. We can’t do that with a function in Stan because functions can’t introduce new parameters. I don’t quite know enough Python to pull off an example myself, but it should be possible. I asked a few of the PyMC3 developers and never got a concrete reference, so please comment if you have a working example somewhere I can cite.

Type checking and data types

Stan enforces static typing on all of its variables. It then provides a matrix arithmetic style that is like that of R and MATLAB. So you just multiply a matrix times a vector and you get a vector. Argument types like requiring a matrix argument as the covariance parameter in a multivariate normal are enforced statically, too. Size constraints and content constraints like positive definiteness are enforced at run time.

With PyMC3 and Edward, we’re in a slightly different situation. Although Python itself is dynamically typed, the random variables in these languages resolve as class instances in their respective libraries. I don’t know how they do error checking—hopefully at object construction time rather than when the model is executed.

Stan has Cholesky factor parameters for correlation and covariance matrices, simplex parameters, ordered and unit-vector parameters. I’m not sure if or how these would be handled in PyMC3 or Edward. Having the Cholesky factors is key for efficient multivariate covariance estimation that’s numerically stable.

One place Stan is lacking is in sparse and ragged data strutures or tuple data structures for multiple returns. We have a single sparse operation (sparse matrix times dense vector, which is the one you need for efficient sparse GLM predictor matrices), but no direct sparse data types. We require users to code their own raggedness in long (melted) form. Is there a way to do any of that more easily in PyMC3 or Edward?

Stan has rather clunky handling of missing data compared to BUGS or JAGS (two languages I know pretty well, unlike PyMC3 and Edward!). PyMC3 gives you a halfway house in that you pass in a mask if the data that’s missing is missing-at-random from an observation vector or matrix (Python doesn’t have the undefined (NA) data structure from R). I’ve been trying to figure out how to make this easier in Stan from the get go.

Whose functions?

Stan uses its own math libraries and all functions resolve to the math lib equivalents (these often delegate to Eigen or Boost) which bottom out in Stan’s automatic differentiation. PyMC3 and Edward functions need to bottom out in Theano and TensorFlow functions to allow analytic derivatives and automatic differentiation respectively. I know that Theano uses NumPy, but I’m not sure if that’s also the case with TensorFlow (there seem to be multiple options for data representations in Edward).

Both PyMC3 and Edward used a class named Normal to represent a variable with a normal sampling distrbution in the model. These classes are not the same in the two interfaces, nor are they the normal distributions built into TensorFlow (tf.Normal, not ed.Normal) or into SciPy (scipy.stats.norm).

Thus none of PyStan, PyMC3, or Edward can just call arbitrary Python functions as parts of their models. I don’t know where the languages stand in terms of functions available. Stan has a library of linear algebra, probability, differential equation, and general math functions listed in the back of our manual, but I’m not sure where to find a list of functions or distributions supported in PyMC3 or Edward (partly because I think some of this delegates to Theano and TensorFlow).

Extending with user-defined functions

Stan provides a language for writing new functions in the Stan language. I couldn’t see how you’d define new desnities in either PyMC3 or Edward directly in their APIs. Of course, you could use functions to put models together (that’s what I expected to see exemplified in their tutorials).

PyMC3 has instructions for adding new functions, but they don’t work for autodiff (presumably you’d need to go into Theano to do that the way you have to go into C++ to add a new underlying distribution as a Stan built-in). So you’re presumably either out of luck or you try to use something like the zero-trick of BUGS and JAGS.

I have idea what you’d do in Edward to write a new density function.

In all three languages, you can go deep and add new differentiable functions to the underlying math library. I don’t know if that requires C++ in either Theano or Tensorflow, or if you could do it in Python (perhaps with some reduced efficiency).

Minor nitpicks with PyMC3

I’m not a big fan of duplicating all the variable names in quotes. My guess is that there’s not enough reflection inside the PyMC3 model class to avoid it.

The PyMC3 argument naming mu, sd bothers me because I’m a neat freak like every other low-level API designer. For consistency, the naming should be one of

• mu, sigma
• location, scale
• mean, sd

The first one seems opaque. The second one is what I’d have used and what Edward uses. The last choice is what R uses, but I don’t like it because it confuses the parameter with a property of a random variable.

Efficiency and sampling

This post isn’t about efficiency of the model’s ability to calculate values and derivatives. In Stan (and presumably the other interfaces if they’re at all efficient), that time is all taken up doing gradient calculations.

I’m more concerned with user efficiency and clarity in writing models, especially when you have a bunch of models you want to build in a sequence as you’re doing an applied project.

Until Edward implements something like NUTS, it’s not going to be very effective at solving general purpose practical problems with full Bayes because HMC is so hard to tune in general (see the Hoffman and Gelman NUTS paper in JMLR for a sequence of striking examples that even downplay how hard the tuning is by collapsing one dimension of the grid that was searched for tuning parameters) and other approaches are so slow to mix. There’s no fundamental obstacle other than that NUTS is a fussy recursive algorithm; I just don’t think the Edward devs care much about doing practical MCMC.

One thing that’s very nice about PyMC3 (don’t know much about inference in Edward) is that you can piece together block samplers. The rest of the PyMC3 interface is very nice for doing lots of things that wind up being pretty clunky in Stan. Some of these design issues are very relevant for deciding if we want to try to embed a graphical modeling language for Stan in Python.

What might an embedded Stan graphical modeling language look like?

We could define a Python API the following way, which would make the way a model is built up in the embedded language very similar to how it’s built up in Stan, but with quite a few limitations arising from the restriction to graphical modeling.

joint = StanModel();
with joint:
N = data(integer())
y = data(vector(N))
x = data(matrix(N, 2))

alpha = parameter(real())
beta = parameter(vector(2))
sigma = parameter(real(lower=0))

alpha.normal(0, 10)
beta.normal(0, 10)
sigma.normal(0, 1, lower=0)
y.normal(alpha + x * beta, sigma)

data = {N:10, y:np.rand.randn(N), x:np.rand.randn(shape=(N, 2))}
posterior = model.condition(data)
sample = posterior.sample(...)


That’d require a lot of using statements, or you could prefix all those functions/constructors with “st.” or maybe something else (I’m have much less experience with Python than R).

And all the arguments would have names, but do we really want to write normal(loc=mu, scale=sigma) rather than normal(mu, sigma)?

Other thoughts?

I’ll leave the floor open at this point. As you can see, I didn’t do a very deep dive into PyMC3 or Edward and am happy to clarify vague descriptions or correct misconceptions in how these languages would be used in practice. Like I said up front, we’re thinking about adding a graphical modeling language to Stan, so I want to understand all of this better. No better way to hold my feet to the fire than blog about it!

NIMBLE is an API that’s very much like PyMC in R. The big drawback there is that they don’t have autodiff (so it’s like PyMC, not PyMC3). Without autodiff or symbolic diff, it’s pretty much impossible to implement HMC or L-BFGS or gradient descent.

1. Andrew says:

Bob:

As you know, I’m interested in allowing the following sort of capability for Stan, which I think should make models easier to build and to read:

data {
int N;
vector[N] y;
matrix[N, 2] x;
}
parameters {
real alpha ~ normal(0, 10);
vector[2] beta ~ normal(0, 10);
real<lower=0> sigma ~ normal(0, 1);
}
model {
y ~ normal(alpha + x * beta, sigma);
}


Also, you forgot the “lower=0” in your declaration of sigma. Pedantic mode would’ve caught that.

2. Great post, Bob. I was actually writing up something similar but in a much less fair and more biased way than you did, figures. A couple of responses:

> So you really need to consider the definitions of Y, X1, X2 as part of the model specification.

Not sure I follow, these are just how data is generated. You could load this in from whereerver. You could also create a placeholder with theano.shared. The alpha, beta definitions are also only required to generate the data.

> I don’t know if truncation is possible in either Edward or PyMC3.

That’s possible with pymc3.Bound() which creates a new, bounded distribution which gets automatically transformed to be on the real line. In fact, that’s how HalfStudentT is defined: HalfStudentT = Bound(StudentT, lower=0)

> Stan has a library of linear algebra, probability, differential equation, and general math functions listed in the back of our manual, but I’m not sure where to find a list of functions or distributions supported in PyMC3 or Edward (partly because I think some of this delegates to Theano and TensorFlow).

There is pymc3.math (http://pymc-devs.github.io/pymc3/api/math.html) which has most of what’s need for element-wise operations (sine etc) and linear algebra. It actually just maps to the theano functions because random variables are theano expressions, they can seamlessly interact other theano functions.

> PyMC3 has instructions for adding new functions, but they don’t work for autodiff (presumably you’d need to go into Theano to do that the way you have to go into C++ to add a new underlying distribution as a Stan built-in).

If a distribution or function can be defined in theano it works seamlessly. If it is Python it can be used but you won’t get gradients. It is possible to specify a gradient by hand in Python by creating a new theano.op. It is also possible to specify them as C-code (or cuda) as a string in the theano op.

> Stan has Cholesky factor parameters for correlation and covariance matrices, simplex parameters, ordered and unit-vector parameters. I’m not sure if or how these would be handled in PyMC3 or Edward.

PyMC3 will have Cholesky factors in the upcoming 3.1 release and we do find it speeds things up dramatically.

> One place Stan is lacking is in sparse and ragged data strutures or tuple data structures for multiple returns. Is there a way to do any of that more easily in PyMC3 or Edward?

Sparse linear algebra is well supported, although it’s not due to PyMC3 as Theano has pretty good support for sparse operations out of the box. PyMC3 also supports a sparse mass matrix for high dimensional models.

> The PyMC3 argument naming mu, sd bothers me because I’m a neat freak like every other low-level API designer.

That’s a great point, we should really change that for the reasons you mention.

• Thanks for clarifying. I figured if I wanted to get responses, I better play nicely. But seriously, I’m just a programming language nerd, so I like looking at all this language design stuff. Having worked on Stan for six years now, I’m well aware of a lot of the tradeoffs.

What I meant about data variables is that the model at the point it’s defined requires Y, X1, and X2 have already been defined (at least that’s what it looked like). Presumably it’s an error if they haven’t. The key thing here is that the data’s rolled into the model. At least that’s what it looks like. Matt Hoffman originally wanted to go with something like symbolic differentation for Stan (of the kind generated by Theano). It should, in principle, be faster, but I could never figure out how to compile a model with code-generated autodiff without having seen the data or in such a way as to allow conditionals based on parameters (though we think the latter is very dangerous because it can break differentiability; the problem is that a lot of the iterative algorithms like the ODE solvers and some of the math ops for cumulative distribution functions and derivatives used a lot of branching internally. Part of the reason we have fewer options with Stan is that we have an imperative programming language that can monkey with the expression graph; if we restricted ourselves to graphical models, there’s a lot more we can do statically (the same way Fortran can be much more heavily optimized statically than C++ because there aren’t pointers or other run-time shenanigans).

As far as extending PyMC3 with functions, I assumed you could do so by writing Theano functions; in Stan, you can write functions in the Stan language, or you can write functions on the outside in C++ to build in analytic derivatives. I’d like to eventually have a way to define a function with analytic derivatives in the Stan language. We just haven’t gotten around to it yet—no fundamental problem there. The bigger point is that while PyMC3 is embedded in Python, the reverse isn’t true—it isn’t like emcee, that lets you write a general density function in Python. Here, I can just quote the doc in the PyMC3 math package, which says, “Doing any kind of math with PyMC3 random variables, or defining custom likelihoods or priors requires you to use these theano expressions rather than NumPy or Python code.” I assume there’s a lot more to the math library underlying PyMC3 than is pointed to in that Math API reference. I’d suggest providing a complete list of functions supported or at least linking to one. I still can’t tell which functions PyMC3 supports (for Stan, the list is in the index of the manual, which we’re in the process of converting from PDF to web format via bookdown).

It’s the same with Stan’s C++ interface—you can use general C++ constructs, but if you want to operate on random variables the functions better be defined using our math lib functions (or generally templated functions in C++—we do implement the entire C++ standard library for derivatives as well as the C++11 extensions).

Nice to know you can do truncation—for Stan, you can truncate any distribution for which the CDFs are implemented. We don’t know of good CDF derivative libraries anywhere—it’s trivial for the random variable, but it’s a lot harder with respect to the distribution’s parameters. We’re gradually making those better, but any pointers would be most welcome on how to efficiently and accurate calculate log CDFs.

For Cholesky factors, we really like the LKJ prior on correlation matrix Cholesky factors—we use it for everything now. We never got around to implementing Wishart and inverse Wishart for Cholesky factors, but it’s on our very long to-do list.

Do you really not have any examples of submodel structure or other uses of the embedding? To me, that would be the main advantage of embedding!

3. Bob, maybe unrelated, but I’ve been thinking recently about the idea of a parallel/cluster Stan process server written in Erlang. The idea would be that the Erlang based server would sit there waiting for requests, would take a request and fork off a CmdStan process to process the model and archive the results. The whole thing could work across largish clusters of machines etc. The erlang server could also monitor the process and do convergence diagnostics as the model runs and abort early with a useful summary of the problems with convergence (tiny step size, divergent transitions, etc) so that you could proceed with debugging before a couple of days of computing, the results of the computation would be archived and independent of whether you crashed your R session or whatever, and other good features for people who run models that take hours or days or weeks etc.

Just some thoughts I had. Do you think this is a useful idea? Does it double-up on ideas you guys are already working on?

• We’re discussing creating a Stan server at http://discourse.mc-stan.org/t/httpstan-the-http-interface-to-stan/586

I’ve also seen some talk about building in some kind of live monitoring on the old users list (I think that’s where I saw it). Getting some sort of feedback as the model runs would be good. The draws do stream out, so it’s certainly possible.

You’re talking about building a server and adding things we haven’t yet implemented for Stan (like auto-warmup convergence monitoring and restart, which would ideally be done across multiple chains to make sure they all adapt to the same metric).

4. Victor says:

I’m not familiar with either stan (yet!) or the other mentioned libraries, but I think that a more “pythonic” feel could be given to model building if it was something like: (sorry, don’t know how to create codeblocks in the reply, I’m guessing 4 spaces

from pystan import Normal
import numpy as np

# data = {N:10, y:np.rand.randn(N), x:np.rand.randn(shape=(N, 2))}

N = 10

x_obs = np.random.normal(size=(N,2))
y_obs = np.random.normal(size=N)

alpha = Normal(loc=0, scale=10)
beta = Normal(loc=0, scale=10, data=x_obs)
sigma = Normal(loc=0, scale=1, floor=0)

y = Normal(loc=(alpha + beta), scale=sigma, data=y_obs)

y.fit()

I don’t know how hard it would be to make it work like this, though.

• I’m reallying getting the feeling that “not pythonic” is just Python speak for no true Scotsman. It would help me tremendously if rather than just name-calling, some explanation was provided for why the thing being branced as “not pythonic” violates some standard idiom.

What you have won’t work, because the data x_obs isn’t associated with beta, it’s that the normal should get a location parameter of alpha + beta * x_obs. Otherwise, it looks a lot like PyMC3 (not that there’s anyting wrong with that!).

P.S. Four spaces is common in markdown, but WordPress went it’s own special way. The most annoying thing is that you can’t preview comments. You need to use

<pre>

</pre>

5. Ben Bolker says:

I don’t have an example handy, but if you want to expand your scope a bit more, Cole Monnahan has also implemented HMC/NUTS on top of Template Model Builder (https://github.com/kaskr/adcomp/blob/master/TMB/DESCRIPTION: in turn built on CppAD and Eigen): https://github.com/colemonnahan/adnuts

• I talked extensively with Cole at ISEC and the ADMB/TMB workshop before that—they invited me to the ADMB/TMB conference to help them understand HMC and NUTS well enough to implement it. Cole (along with James Thorson and Trevor Branch) wrote a really nice evaluation and tutorial on HMC and NUTS—they got lots of great feedback on that from Michael Betancourt.

TMB itself sort of spun out of ADMB—I say “sort of” because it’s a new language and it uses Brad Bell’s CppAD autodiff system rather than the original ADMB system. Brad was also at that workshop. ADMB was decades ahead of its time in using autodiff to fit stats models. But it wound up mainly being used for fisheries as far as I can tell. I largely blame academia’s insularity for this kind of thing.

6. Tom Passin says:

“I’m looking for something like Bloch’s Effective Java or Myers’s Effective C++.” Here’s a book you could look at:

Effective Python: 59 Specific Ways to Write Better Python

by Brett Slatkin

• You’re the second person to recommend that. I’ll take a look. It’s in Meyer’s (turns out there are two ‘e’s in his name) series.

7. John Jumper says:

A couple of random thoughts:

I can think of an easy way to use the more dynamic nature of Python. You could choose at runtime whether a symbol y is a variable or a placeholder with a simple if statement. I haven’t used Edward directly, but it is quite simple to do in Tensorflow. This would allow you to support either inference or forward sampling using the same code.

Also, you should take a look at PyTorch. It does autodiff directly in Python rather than with a separate compile step (as in Tensorflow). The autograd tape is built dynamically in python on every forward pass. No one has built a nice Bayesian package like Edward on top of it to my knowledge, but it is suitable base to do so. Adding primitive functions to PyTorch is quite easy (http://pytorch.org/docs/notes/extending.html). It has GPU acceleration and is quite fast. Also, PyTorch tends to use objects which combine the operation (like 2D convolution) with the associated parameters. In practice, this is a nice way to think about complex models.

Adding primitive functions to Tensorflow is quite easy as well, and can be written in any of Python, C++, or CUDA. You just have to inherit from the appropriate Tensorflow class. I don’t know if there are extra things to inherit from in Edward, though.

This whole post really shows how much value the Stan team could add by splitting out their sampling and adaptation algorithms into a generic library ;) There are many Stan competitors for autodiff but very few I know for really good Bayesian sampling. The other big value for Stan is the selection of priors / densities but those seem harder to share in a library.

• Our modeling language does compile down to something that could be run with someone else’s autodiff. But we don’t see a demand. Our algorithms (HMC, L-BFGS and ADVI) can all be applied to any C++ object that satisfies the Stan model concept (which is not very complex). But we don’t see a demand.

Basically, everyone capable of integrating our library wants to build their own. Back when Matt and I started this project in 2010, there weren’t any decent C++ autodiff libraries out there. We really didn’t want to build our own, but the ones we found (Sacado, CppAD, AdolC, were all very difficult to extend and didn’t seem to provide very clean APIs [Sacado is closest to Stan]). We originally thought we’d build in Python, but we found Python way too constraining and too slow. So we decided we needed to build everything that did any heavy lifting in C++ anyway, so we then figured we might as well make something that works in Python and in R and on the command line, rather than just in Python. The real motivator there is that most stats people use R, not Python, whereas it’s exactly the opposite in machine learning.

What many people are telling us is that we’ll have a hard time in Python because PyStan is not pythonic. I’m currently trying to figure out what that means! We got the same pushback from Stata reviewers, but alas, they didnt’ say Stan wasn’t statanic (though I wish they would have).

• John Jumper says:

First, “statanic” definitely needs to become an accepted term, perhaps as an antonym to “stantastic”.

I haven’t encountered big slowness problems working with Python-based autodiff, but I am probably more careful than average at making sure I vectorize everything.

I definitely noticed some rough edges on PyStan that I would consider “unpythonic”. In terms of scientific Python, Pythonic really means “feels like using numpy or maybe pandas”. Tensorflow figured this out the hard way, and they ate a bunch of breaking changes before 1.0 to make their interfaces match numpy (see https://www.tensorflow.org/install/migration).

In terms of Stan, I remember thinking that it was weird/unpythonic that I had to separately call “extract” after running Stan sampling. Why not just let me do stan_results.parameters[‘beta’]? I am sure there are other oddities I am not thinking of right now.

8. Dustin Tran says:

Thanks for the insightful and very fair comparison. Other things that I think would be useful to compare are: whether you can sample from the model in the language; how easily you can control inference; how the language supports simulator-based models; code and inference algorithms for a recursive probabilistic program; and how non-probabilistic/Bayesian methods can fit into the language.

> Edward is also different than PyMC3 and Stan in that it broadcasts up the parameters so that they are all the same size. That’s the purpose of the size in scale=tf.ones(2) in the expression defining beta and also the purpose of the multiplication in scale=sigma * tf.ones(N) in the expression defining y.

Edward can also broadcast internally. For example, Normal(loc=tf.zeros(5), scale=1.0). We don’t do so in tutorials in order to make the parameterizations explicit.

> I couldn’t find examples in either Edward or PyMC3 that make non-trivial use of the embedding in Python.

We use the non-trivial embedding for many non-trivial inference problems. For example, the GAN tutorial (http://edwardlib.org/tutorials/gan) uses Python to manually control the inference, generate images from the model, and then save them to a directory (here, Python is needed to direct inference, rather than just send out one command where the PPL does the rest). In a mixture of Gaussians script (https://github.com/blei-lab/edward/blob/master/examples/mixture_gaussian_gibbs.py), we show how you can schedule your own Gibbs sampling scheme. I think we also have an example somewhere we trained a proposal distribution using variational inference, and then applied that proposal distribution in for Metropolis-Hastings.

> Stan has Cholesky factor parameters for correlation and covariance matrices, simplex parameters, ordered and unit-vector parameters. I’m not sure if or how these would be handled in PyMC3 or Edward.

Edward has a bunch of multivariate normal distributions each with different parameterizations; one includes a multivariate normal whose covariance is parameterized by a positive semidefininte matrix. There are various transformations in TensorFlow that can help you go from one data structure to another.

> One place Stan is lacking is in sparse and ragged data strutures or tuple data structures for multiple returns. We have a single sparse operation (sparse matrix times dense vector, which is the one you need for efficient sparse GLM predictor matrices), but no direct sparse data types. We require users to code their own raggedness in long (melted) form. Is there a way to do any of that more easily in PyMC3 or Edward?

Sparse tensors and operations on sparse tensors are supported in Edward. Sparse random variables are not.

> Stan provides a language for writing new functions in the Stan language. I couldn’t see how you’d define new desnities in either PyMC3 or Edward directly in their APIs.

One way of thinking about Edward’s modeling language is that it’s exactly TensorFlow + random variables. Any new functions you write in TensorFlow are applicable on Edward random variables because it’s all part of the computational graph. For implementing new Edward random variables (that is, statements that include sampling or log-densities), we have documentation at http://edwardlib.org/api/model-development.

• You probably have a better perspective on those other things. As you know, we can’t “sample from the model” in Stan—the modeling language is too general for that. That’s why we’re thinking we should add a more restricted graphical modeling language where we can do inference on model structure. Then it’d be easy.

I don’t know what a “simulator based model” is.

I don’t know what you mean by a “recursive probabilistic program”, either. Stan’s functions support recursion, but I don’t think that’s what you mean. Is there an example? As I recall, all the PPAML examples were nothing we’d ever want to fit. Is there a practical example somewhere of what you want to do that you don’t think can be coded in Stan, or is this a programming language theory exercise?

I’m also unclear on what you mean by “non-probabilistic/Bayesian methods”. You could write quicksort in Stan if you really wanted to, but I’m assuming that’s not what you mean. You could use Stan to compute something like ODE solution sensitivities (popular in a lot of aeronautical and structural engineering problems). You can certainly use it for optimizing generic functions. And you can do penalized MLE. Was there an example you had in mind?

• Dustin Tran says:

> As you know, we can’t “sample from the model” in Stan—the modeling language is too general for that.

The model class can be “too general” in either direction. It’d be nice that the language accommodates both model classes, i.e., those that can only be tractably sampled from, and those that can only calculate tractable densities. If you do this at the random variable level, you can also get at hybrids such as deep belief networks—or motivating from statistics, some type of an exponential family prior on parameters of a graphical model but where the exponential family’s partition function is intractable.

> I don’t know what a “simulator based model” is.

A simulator based model, also known as an implicit model, is a model that can be tractably sampled from but doesn’t necessarily admit a tractable density. For example, models whose generative process is defined by differential equations. Stan handles DE-type models by using DE solvers to calculate the density. Alternative methods known as likelihood-free inference consider how you can either purely do inference assuming only samples, or if you build approximations to calculate the density, how you might incorporate that approximation error when doing Monte Carlo. A key application, is say, particle physics where you build a model of two particles colliding together. They are defined by complex dynamical equations, and you’d like to infer some parameters you put into those dynamics.

> I don’t know what you mean by a “recursive probabilistic program”, either. Stan’s functions support recursion, but I don’t think that’s what you mean. Is there an example? As I recall, all the PPAML examples were nothing we’d ever want to fit. Is there a practical example somewhere of what you want to do that you don’t think can be coded in Stan, or is this a programming language theory exercise?

Mostly a programming language theory exercise.

> I’m also unclear on what you mean by “non-probabilistic/Bayesian methods”. You could write quicksort in Stan if you really wanted to, but I’m assuming that’s not what you mean. You could use Stan to compute something like ODE solution sensitivities (popular in a lot of aeronautical and structural engineering problems). You can certainly use it for optimizing generic functions. And you can do penalized MLE. Was there an example you had in mind?

There are a lot of methods, particularly in machine learning, that are not motivated from the model -> p(theta | x) paradigm. They might have a probabilistic/Bayesian interpretation but in any case, it’s interesting to see how these cases might fold into the framework. In machine learning, examples include decision trees, boosting, nearest neighbors, SVMs. In deep learning, examples include greedy layer-wise pre-training, dropout and batch normalization (which differ in implementation during a “training” vs “test” phase), autoencoders, neural nets regularized by early stopping, etc.. (In order to attract an ML / deep learning audience, how Edward accommodated non Bayesian methods was a key factor.)

• Corey says:

This is where I plug one of my favorite investigations of recent years which showed that dropout can be derived as a stochastic variational approximation to full Bayes in deep nets.

• That makes sense. We aren’t at all targeting those machine learning models: deep belief nets or non-probablistic models (SVMs, K-NN, etc.). Even within probabilistic models, we’re concentrating on tractable likelihoods. Primarily because we can’t provide any algorithms with reasonable guarantees for any of the approximate likelihood kinds of things that they do under the heading of ABC (approximate Bayesian computation). Have you found those to be practical solutions to applied problems somewhere?

We’re aiming at a more applied, stats crowd rather than a computer sciencey, open the hood and mess around kind of audience. I should’ve mentioned that up front in the post as we’ve talked about that before. As Michael Betancourt put it, we’re trying to fit the largest general class of models that can be robustly fit with Bayesian models. That rules out all the LDA type models, too, for which Bayesian inference is completely intractable. We can code them up in Stan and get local optima and even sample around that local optimum, but it’s not proper Bayesian inference, so we’re not that interested in it and just tell people to go use Matt Hoffman’s Vowpal Wabbit stochastic variational implementation.

I’m also not clear we can’t code the things that the PPAML people want to code recursively. Last time somebody made that claim (let’s not name names here), I sent them person code showing them how to do in Stan what people were claiming was impossible (can’t even recall what it was).

And while we’re on the topic of approximate algorithms and PyMC3 and Edward, was there ever any progress on tuning ADVI adaptation that we could fold back into Stan? I talked to Dave Blei about it and he just shrugged and said all those evaluations where we couldn’t get ADVI to fit were “all the same” and “not the kinds of problems we care about”, so I took that as a negative. But then I was surprised that PyMC3 can get away with using ADVI for initialization. Did they perhaps figure out a better adaptation strategy than you and Alp coded into Stan?

• Dustin Tran says:

I think with Tianhao, we identified the issues and made next steps to solving how to both do better initialization and better monitor convergence. No solutions as of yet though.

• Great news! I know Andrew was saying that scaling was an issue and when things were transformed to unit scale everything adapted better.

9. Dzhaughn says:

As for being “pythonic,” my cursory reading is that you are trying to create a so-called “internal domain specific language” (specifically, internal to Python). If that’s not true, you may at least want to understand “Why not?” for your own sanity.

https://martinfowler.com/books/dsl.html

https://www.manning.com/books/dsls-in-action

(Ruby was the classic internal DSL language a few years ago; examples in Ruby usually translate to Python easily.)

• We’re well aware of the distinction—that’s largely the point of the post. Stan is a domain specific language (DSL), but it’s not embedded. PyMC3 and Edward are embedded domain specific languages.

10. Thanks for the really nice comparison!

Could you say more about what you mean of “making use of the embedding in Python”? I’ve been playing around with an alternative API for PyMC3 to make the models more reusable and composable (which might be what you mean), and pushed an alpha version to pypi tonight (https://github.com/ColCarroll/sampled). The equivalent “Hello, world!” there would be

from sampled import sampled

@sampled
def hello_world(X):
alpha = Normal(‘alpha’, mu=0, sd=10)
beta = Normal(‘beta’, mu=0, sd=10, shape=2)
sigma = HalfNormal(‘sigma’, sd=1)
Normal(‘y’, mu=alpha + X.dot(beta), sd=sigma)

A variable “X” is required, but all other observed data may also be specified at sample time, so these would both be valid:

with hello_world(X=X):
generated_data = pm.sample()

with hello_world(X=X, y=y):
posterior = pm.sample()

• *those would be valid up to an indentation

• Yes, what I mean is that a hierarchical linear regression looks like this:

$\mu \sim \mathrm{Normal}(0, 5)$

$\sigma \sim \mathrm{LogNormal}(0, 1)$

$\alpha \sim \mathrm{Normal}(\mu, \sigma)$

This defines a joint density $p(\mu, \sigma, \alpha)$. What I’d like to be able to do is write something that encapsulates that model as Python object so that I could take something like regression coefficients $\alpha$ and say, give $\alpha$ a hierarchical prior. We can’t do that in Stan becuas eit introduces two new parameters into the model, $\mu, \sigma$, and our functions don’t let you introduce new variables. But I’d think that you could write such a function in Python.

11. If I’ve understood the ask correctly, you might be interested in our prototypes as examples of “non-trivial embeddings” of Stan in larger applications.

We used Stan to build two non-trivial interactive web applications. One is publicly accessible: http://www.fastforwardlabs.com/pre/. I’d be glad to show the code to anyone interested.

To say we “embedded” it, however, is to stretch the definition of the word. For an interactive application, you need a way to run posterior prediction on brand new data that was not present at training time. Put another way, we’d love for pystan’s model object to provide something like the scikit-learn API, i.e. fit() and, in particular, predict(). That would have made our task simple.

But the only way we could figure out how to do posterior prediction was to persist a large number of samples from the joint posterior, and then reimplement the model in a general purpose programming language suitable for developing applications (Python in our case). To sample from the predictive posterior for a new observation, we ran that observation through the model with a random sample from the persisted joint posterior. We repeated this many times to build up a full picture of the predictive posterior inside our general purpose language, and then served that over an API to the front end.

Needless to say, this was inefficient and error prone. We had to implement the model twice (once in Stan, once in Python). We had to keep them in sync and ensure there were no inconsistencies. And had only a finite set of samples from the joint posterior which in many models (including ours) leads eventually to deterministic predictions. An application with >= 2 languages is not inherently a problem. It may be a worthwhile trade-off. but it’s a trade-off. And not just intellectually (you’ve got to learn another language), but in terms of the overhead of connecting them together, which as far as we could tell is particularly convoluted in Stan for the reasons above.

If there’s an easier way to do the scikit-learn fit() and predict() dance with stan (and Daniel Lee tells me there is!), I couldn’t find it in the manual! Our application would have been much simpler had we done it in pymc3 (although we depend on Cholesky factors, so it would also have been slower until pymc3 3.1). This for us is the number one thing that makes us hesitate to recommend stan for anything but one off batch analysis, where the output is a joint posterior distribution yielding a publishable unit. To be sure, that’s the vast majority of analyses in the natural sciences, but it excludes the use case of most of our clients: interactive applications and user products.

• Andrew says:

Mike:

Soon it should be possible to run a generated quantities block on existing simulations without having to re-fit the model; this should solve some of your problems listed above.

• Pat says:

Your map looks awesome, I would love to see the code! I do GIS in health service research and find myself looking for interesting models/visualizations. I mostly gravitate toward INLA and Leaflet (in R) but I have been working on GPs in Stan.
Neat stuff.

• Thanks for sharing the link.

As Andrew ays, we’ll be extending Stan soon by letting you run generated quantities after a model has been fit. That’ll have the advantage over doing it in R of letting you measure convergence—nonlinear functions of parameters don’t always mix at the same rate as the parameters themselves, so you’ll get more non-convergence diagnostics that way. As you point out, just getting the data isn’t so great for this to do in R, because it’s a pain to unpack the samples and you need to fiddle with all the indexing, which is treachorous in R with implicting broadcasting and reduction. Jouni Kerman, working with Andrew, wrote the rv package for dealing with random variables as a set of samples; it was one of the motivating factors in how we designed the random variables in the Stan language.

We were waiting to stage a bunch of this stuff until the big command refactor landed, and that’s in Stan 2.15 (thanks to Daniel!), so we’re good to go on adding new features like this.

Please let us know on the users list if there are features like this you want. It’s the usual “squeaky wheel gets the grease” issue—unless other people tell us what they need, we just build whatever we need for our next applied project (that’s why Andrew wants generated quantities that run on their own).

• Thanks! And great news on the plan for live generated quantities. My suggestion would be to make the pystan API for that as similar as possible to scikit-learn’s fit/predict, which is becoming a standard (see, e.g. keras).

I discussed with Daniel the possibility of my writing a “Developing applications (as distinct from doing analyses) with Stan” article. Four weeks later and I haven’t written it, letting perfect be the enemy of good. Would the user mailing list be the right place to post my thoughts/experiences, or is there somewhere else for that kind of discussion?

By the way, we used Python not R. R is of course an enormously popular and expressive language for data science, but if the focus is on software engineers developing deployable applications using Stan, then you’ll find much more demand for Python support than R. E.g. when we point our clients to pymc3 over Stan for probabilistic application development, none of them have said “but I need R support”.

• Yes, the new user list on Discourse would be the place to post:

http://discourse.mc-stan.org

That’s also the place to informally request features (we don’t really want to add issues until we have a technical spec that’s actionable).

We completely understand the dominance of Python in computer science. I wouldn’t shed a tear if Python eats the R market (graphic imagery intended). I just wish the community had settled on a better language before standardizing; the math looks far too much like Java and far too little like MATLAB for my taste (R is an odd in between with data structures that drive me mad and that annoying * defaulting to elementwise operations and %*% for guess which operation I want matrix multiplication). And the ongoing split between 2.x and 3.x is just super annoying from a developer perspective.

12. Anonymous says:

I’d recommend watching Beyond PEP 8 — Best practices for beautiful intelligible code:

http://pyvideo.org/pycon-us-2015/beyond-pep-8-best-practices-for-beautiful-inte.html

• Not a big fan of videos. Is there a written version somewhere I can glance through in less than 53 minutes?

• Anonymous says:

Unfortunately not that I can fined. And thank god for speed controls on youtube. The core example starts at ~10 min in.

I think the TLDR is don’t make me do things the language has abstractions for:
You need me to call cleanup? Then let me use with.
You want me to get things by index? Then let me use [i] and provide an iterator.
You give me a size? Then let me use len().
In short use idioms where appropriate. It is not the big things, it’s the thousand little paper cuts.

• That video was great. Not just great in terms of content, but also a really great way to present the material. And it answered a lot of my questions. Thanks!

13. cavities says:

~i want to use edward instead of pymc3,but i can’t find the observed mode when i use Poisson distribution.in bayesian hacker book chapter_1 pymc3 use mcmc get this params but edward seems to not do it;
how many people use it,I’m worried about edward failure;

14. Yasmin Lucero says:

Perhaps late to this, but Jeff Knupp’s book is a good general guide to what “pythonic” means: https://jeffknupp.com/writing-idiomatic-python-ebook/.

Idiomatic python doesn’t have that much to say about data flows and manipulating data frames. Pandas and numpy are obviously the standards that people are use to, but personally I would way rather see a flow like the R/tidyverse ‘dplyr’ data flow, or similiarly, the Spark APIs.

15. C Tan says:

A succinct description of the ‘Python way’ was given by Tim Peters:

https://www.python.org/dev/peps/pep-0020/