This post is by Bob Carpenter.

*Stan is Turing complete!*

There seems to a persistent misconception that Stan isn’t Turing complete.^{1, 2} My guess is that it stems from Stan’s (not coincidental) superficial similarity to BUGS and JAGS, which provide directed graphical model specification languages. Stan’s Turing completeness follows from its support of array data structures, while loops, and conditionals.

*What is a probabilistic programming language?*

A free, early-access chapter of:

Avi Pfeffer. 2014 partial draft.

Practical Probabilistic Programming. Manning Publications Co.

neatly sums up (on p. 9) the emerging definition of “probabilistic programming language”:

Representation language:a language for encoding your knowledge about a domain in a model

Probabilistic Programming Language:A probabilistic representation language that uses a Turing-complete programming language to representation [sic] knowledge.

The next paragraph continues with

Although probabilistic representation languages with programming language features have been developed that are not Turing-complete, like BUGS and Stan, this book focuses on Turing-complete languages.

And yes, I’ve let the author know Stan’s Turing complete on the Manning feedback forum, and he’s already responded that he’ll fix the next draft.

I’m also curious why Frank’s tutorial on probabilistic programming lists Stan on page 7 under the heading “Application Driven” rather than the heading “Probabilistic Programming Language”.

Stan’s really an imperative language for specifying a log posterior (or penalized likelihood function if you want to roll that way) up to an additive constant. Bayes’s rule shows that it’s sufficient to specify the log joint up to a constant. We then translate to a C++ class that reads in data in the constructor and provides the log posterior as a function of the parameters as a method in such a way that it can be differentiated at first (or higher) order. We then use this class for inference with MCMC or for point estimation with L-BFGS or for letting users explore their own inference algorithms.

If a taxonomist forced me to split, I’d say Stan is an imperative probabilistic programming language, Pfeffer’s language, Figaro, is an object-oriented probabilistic programming language, and Church is a functional probabilistic programming language.

*Expressive in practice or in theory?*

I’d also like to discuss the following claim, on page 9 of Pfeffer’s pre-release book version, and echoed in every other intro to probabilistic programming I’ve seen:

Naturally, representation languages with greater expressive power will be able to capture more knowledge, making them more useful.

The true measure of usefulness for a user is how easy it is to express the class of models the user cares about and how efficiently, scalably, and robustly the software can fit the model. Assembly language is Turing-complete and can encode a pseudo-random number generator, but that doesn’t make it “able to capture more knowledge” than BUGS in any practical sense.

Mark Chu-Carroll has an insightful discussion on his *Good Math/Bad Math* blog of Turing equivalence and Turing completeness and what it’s good for.

*Whither Modularity?*

What we’re really after is modularity. Stan’s going to add user-defined functions in version 2.3.0, which will be handy. For one thing, that should make it clear that Stan’s Turing equivalent. But functions are not not going to solve our modularity problem.

By modularity, I mean the ability to express a Bayesian “submodel” directly in the language. For instance, I might drop in a hierarchical logistic regression module with a configurable set of hyperpriors and then re-use it in many models. But I can’t do that in Stan as it will be in version 2.3.0, because that would require somehow exporting variables for parameters and data from the function to the global environment so that our inference algorithms, like HMC and L-BFGS, can see them jointly with all other parameters. Maybe in Stan 3.

Suggestions welcome, as always!

^{1} Informally speaking, a computable system of defining functions is Turing complete if it can compute any computable function. All of the standard programming languages (C, Java, Python, R, Lisp, etc.) are Turing complete, as are Turing machines and general recursive functions and untyped lambda calculus and first-order logic (the connections to Goedel’s incompleteness theorem run through this connection).

^{2} Technically speaking, no modern computer is Turing complete because there is a finite physical bound on available storage, whereas the full definition requires unbounded storage. So we squint a little and pretend we have infinite amounts of storage.

I’ve been assuming that Stan is Turing complete, so nice to hear confirmation that it is.

I find the Stan imperative approach is very easy to think with. It has made simple some models that would require lots of data reformatting in BUGS/JAGS. Or maybe I’m just not fluent enough with the BUGS/JAGS declarative style.

For example, I’ve been working on a big hidden Markov model in Stan. The are hundreds of individual agents, each with multiple behavior sequences, and each sequence needs its own forward algorithm pass. With Stan, I could just feed in the data in its natural (grouped by sequence) form, and then use conditionals to detect when to restart the forward algorithm and update the log posterior.

It’s even more complicated, because the “emissions” in the model are latencies, and they are censored at start and end of each sequence. So the log-probability of each emission is a conditional likelihood depending upon position in the sequence. Again, it was easy to code this in Stan. It worked on the first try, actually.

I’m not sure how I’d do all of this in BUGS/JAGS, to be honest. Maybe it’d be simple. But my imperative programmer mind (I learned C on a SPARCstation) got it to go right away with Stan. And the resulting model is very efficient.

So it may be that users who haven’t done a lot of imperative programming will prefer a JAGS-style declarative approach. But I find the Stan approach very cromulent.

Is there a syntax diagram for Stan somewhere? I was curious to look.

Yes, the manual has a full BNF (though I think it’s missing some of the operators).

Thanks! And RTFM to myself. :) ( I must have missed it)

How will the user-defined functions work? For example, will we be able to use more or less any old R function in a Stan model?

They’ll let you define functions in a C++ or C-like style that can be used as statements or expressions in the language.

Stan is not based on R. There’s no way to access R functions, nor do we plan to. For any C++ function we use in Stan, it needs to be fully templated so that we can automatically differentiate it and every function used in the definition needs to be one that we define. So there’s no way we can roll in any R functions, much less all of them.

Is there something specific you’re looking for?

BOB: “Stan’s really an imperative language for specifying a log posterior (or penalized likelihood function if you want to roll that way) up to an additive constant”

I hope that is not what STAN aims to be. IMHO STAN should aim to be something more high level and abstract. Personally, I would aim for: “STAN is a language for encoding your knowledge about a domain using generative models”.

The difference is subtle but critical. The latter focuses on usability not statistics. (Science statistics.)

PS DAGs also meet the criteria of being “a language for encoding your knowledge about a domain using generative models” except they do not generate anything by themselves. Even so, you can still answer a lot of causal questions with them. In this sense we have:

probabilistic programing language = representation + generation

I think you quote lays more emphasis on the generation than the representation, while mine focuses on the representation. So to be exact maybe we should say: “STAN is a language for encoding your knowledge about a domain _and replicating its output_ using generative models”. After all, much of BDA is about y_rep. But again, the focus is on the use not the stats under the hood.

Why aren’t DAGs becoming more mainstream, more widely used? This is not snarky. I’m just curious.

@Rahul

My sense is they are becoming much more widely used, and their use will grow exponentially. But science advances one dead statistician at a time :-)

Graphical models more generally are very widely used in ML. Just open a recent textbook like Murphy.

Fernando:

Now that my life is (in expectation) quite a bit more than half over, and I’ve had heart problems, I’m really starting to dislike that expression. No joke.

@Andrew

I take it back. Sorry.

Thanks.

Hierarchical/multilevel models are basically DAGS. Even Andrew/Bob often refer to Stan as “A (Bayesian) Directed Graphical Model Compiler”

Why aren’t DAGs/hierarchical models used more often relative to say, trying to shoehorn every applied data analysis problem into combinations of t-tests, linear regressions, bootstraps, fisher exact tests, and multiple comparison corrections? I would say this is an unfortunate consequence of 4-5 generations of failed statistical education.

Anon: “Hierarchical/multilevel models are basically DAGS.”

No they are not. DAGs are non parametric, and causal in nature. Hierarchical models need not be.

“No they are not. DAGs are non parametric, and causal in nature. Hierarchical models need not be.”

Hmm, what do you mean by non parametric? Isn’t multivariate Gaussian used practically always for continous data? (Note: Most stuff what I have read about grahical models doesn’t actually involve causality, only conditional dependencies. So the way I see DAGs is probably quite a bit different than what you actually mean in this case)

PS to be clear I am not arguing for DAGs and against hierarchical models. On the contrary, I think their combination is awesome. But they are not the same thing. I think the former are better for identification, and the latter for estimation.

I’d like to make that past tense! I did a presentation at the ML Meetup in NY before Stan was fully implemented and before we fully realized how general the language was ourselves. We try not to refer to directed graphs at all any more, because they’re not actually intrinsic to Stan’s modeling languages. Now all we need to do is expunge it from our C++ namespaces, which still have the model parser and code generator in

`stan::gm`

.The directed acyclic graph structure is only part of the model. You also need the density or mass function for each node conditioned on it parent nodes. And then you usually have “plates,” which in theory can be blown out to simple DAGs. But this is very inefficient, as you can see, by say, trying to scale up an IRT model in both students and questions. BUGS or JAGS scale up in memory with the number of edges, whereas Stan scales with the number of nodes.

Bob: “The directed acyclic graph structure is only part of the model. You also need the density or mass function for each node conditioned on it parent nodes.”

Yes, that is the difference between a (deterministic) causal model, and a probabilistic causal model. But you only need the probabilistic part for some purposes and not others. And you can get very far without it (e.g. you can understand identification, selection bias, missing data, instrumental variables, exclusion restrictions, etc using graphs alone, and often much more easily an intuitively than with probabilistic models).

From a didactic perspective I think this is useful bc we need not scare people away with complex math, etc.. from the get go. They can get very far with much less machinery. (But I understand that you want to cater to people already advanced who want to do even more advanced models easily.)

BTW this is in no way to criticize what you are doing. I expect to only use Stan or something like it in the future. I just don’t think it does Stan justice to describe it as a language to specify a log posterior up to an additive constant even if indeed that is exactly what it does. I mean, is that the mission statement for the Stan team: To create a language for specifying a log posterior up to an additive constant?

One of our main goals for the Stan

languageis to provide an easy and expressive and efficient way of specifying the log posterior. As well as for generating derived quantities probabilistically. That’s not an end itself. We also want to provide estimation and inference and model checking over that.We’re going to be diving into tutorials, but we don’t have any plans for deterministic graphs or really for graphical models at all other than as a subset of more general models. After all, (smoking -> cancer) is hardly deterministic. We’re trying to make a tool for applied statisticians, not logic students.

Bob: “After all, (smoking -> cancer) is hardly deterministic. We’re trying to make a tool for applied statisticians, not logic students.”

Many problems that are considered statistical problems — from Simpson’s paradox, to missing data, etc — do not require any knowledge of probability. You can understand and address them using DAGs, and d-separation, period. Indeed you can even use DAGs + d-separation + data & visualization to make non-parametric inferences. At which point the main benefit of statistics is regularization for sparse strata (and, having gotten here I wonder whether regularization is statistics or a convenient kludge).

To wit, it is easy to show using this framework where and why MRP will go wrong, and how to test for it in specific applications before believing your results. Presumably this is of interest to statisticians. Unfortunately probability is useless in this regard bc it cannot differentiate causation from correlation. It is an ambiguous language for causal inference, an ambiguity Stan will inherit if you ignore DAGs.

PS: OK you need basic knowledge of probability distributions and conditional independence but that is about it.

smoking cancer becomes probabilistic once you specify a distribution for the background causes U as in smoking cancer $\leftarrow$ U.

At a basic level the U is what probabilistic programing languages like Church automate, with routines like flip. But in causal inference this is often not the contentious part, unless you want runs simulations or make inferences about unobserved variables. The acrimony is typically about theory and identification, none of which require statistics per se.

Moreover, DAGs are intrinsically heterogeneous so they include all possible interactions. The issue is how you want to estimate them.

So I don’t know about applied statisticians, but applied scientists can get very far with this framework, and do things in a simple fashion where many statisticians still struggle.

What kind of use do you have in mind? Because my impression matches Fernando’s.

Some models can be easily expressed in a simple directed acyclic graph specification language. Any such model without discrete parameters can be directly translated line-by-line from BUGS/JAGS to Stan.

And while my original conception of Stan’s language was following BUGS, we all soon realized it was much more general. It’s really just a way to specify a log posterior. So you don’t have to specify the full joint probability in Stan, just something equal to the log posterior up to an additive constant. This opens up many more programming possibilities.

Another example is that we can do lots of imperative programming in the model, such as including conditionals that are quite a pain to express in BUGS even if it is technically possible in many cases.

There are also many researchers (mostly in AI-related fields as far as I can tell) using DAGS for causal inference and/or graph structure induction (I’m not sure how these are related, to be honest, despite being a professor in the same department as Clark Glymour and Peter Spirtes for almost a decade — I did logic and natural language semantics, not stats, back then).

PPS One should also be clearer about the meaning of “domain knowledge” and “generative models”. From the graphical roots of probabilistic programing languages I surmise these refer to causal knowledge and structural models. But they need not. They could refer to knowledge about correlations and predictive models.

It’s not so much what Stan aims to be as what Stan actually is.

You’re right about the subtlety — I didn’t understand the point about science vs. statistics in the first post or either of the second posts. Maybe I’m missing what you mean by “generative.”

I think of Stan as working in two stages — the language defines a log probabilty function, then you can use it for sampling or optimization. And we plan to add marginal maximum likelihood, EP, and VB over the next year or two, so there will be more options in terms of inference to plug in for the models defined in the language. But the language itself is agnostic as to how the log probability, data, and parameters defined by the model will be used.

Probabilistic programming languages like Figaro (object oriented) or Church (functional) don’t seem to derive from graphical model representation languages like BUGS, at least as far as I can tell.

My understanding is that Church came up as a way of extending graphical models, which become ugly and unwieldy when you have too many nodes.

Smoking Cancer is a generative model in the sense that it captures what we think is the data generating process. But to get it to actually generate data you need to add a few random variables, like the flip routine in Church.

You don’t need formal Bayes for any of this, though it can be very useful. But all I’m trying to say is the goal ought to be to keep things simple. If even Andrew has trouble understanding VB I think there is something wrong. I find simplicity a great virtue. What I’ve learned using DAGs is that things can be a lot simpler that using stats. You can search for Simpson’s paradox in this blog to see my point.

Fernando:

You write, “If even Andrew has trouble understanding VB I think there is something wrong.” I disagree. You might as well argue that, given that almost none of us understand how a jet engine works (let alone the principles of heavier-than-air flight or the metallurgy needed to build an aluminum fuselage), that we should be going across the Atlantic in sailboats.

We use complicated methods such as HMC and EP and VB because we have problems to solve. We want to get to Europe in 7 hours rather than 7 weeks, as it were, and we’re willing to use the latest technology to do so. It is the nature of the latest technology that few people understand it.

Andrew:

I disagree with you. The latest technology is not always the best solution. That is why we use Russian rockets to send satellites into space, or why the Mig fighter jet, as I understand it, was such good value for money.

You like regularization in statistics. I like regularization in research practice. Here is what I mean:

Despite a century of research we still don’t know why people get fat. In this time we have used increasingly complex statistical techniques, faster computers, etc.. yet arguably 19th century doctors had a better understanding of the underlying causes. There are many reasons for this, including many you discuss in this blog (incentives, publication bias, etc.)

But here is my hypothesis, and counterfactual scenario. If the only stats we had ever taught researchers over the past century was to compute a difference in means and CI. And if instead we made them focus on theory, concepts, measures, measurement instruments, and research design. Then I bet we would have much better knowledge today about the causes of obesity. Admittedly this is just a hypothesis. But in my defense I would adduce that much of the “credibility revolution” in econometrics is a back to basics approach. Natural experiments try to recreate simple experiments, where often all you need is a simple difference in means.

Most people (myself included) don’t know how to handle complex methods, and are taught to believe that science is running a regression — the more complex the better. I believe this has it all upside down. You are building powerful stuff, but that does not immediately imply you will get powerful results (not you personally, who know how to use it, but broadly). In an ideal situation yes, in practice, I am not so sure. This is why, when Bob in the post asks for suggestions, I suggest he focus on usability and simplicity. But feel free to ignore me.

Fernando:

I guess it depends on what you want to do. For the sorts of problems I work on, it is computation that limits me from making my models as complicated as I’d like. For example, I do a lot of work on sample surveys. To generalize from sample to population requires a lot of struggles, as the saying goes. If we could easily fit models with deep interactions, that would help. And I think the same principles arise in causal inference. As for natural experiments, they’re fine as far as they go, but as we’ve seen on this blog, people often get a bit too excited by them. If sample size is small and variation is high, you’ll run into trouble no matter what.

I agree that for some problems you really need a super computer, and the latest fancy software. I am just reacting to the idea that the more complex it sounds the better it is. That may be true for some applications but not all (perhaps not even many).

But we are trying to focus on usability and simplicity!

Our motivation in building Stan was that we had models we wanted to fit that other software couldn’t fit, or at least not as efficiently as we wanted. We of course still have a long way to go.

Some problems don’t have simple solutions. Or at least don’t currently have simple solutions. What’s considered “simple” changes over time as more people learn it. For instance, I now believe calculus, linear algebra, and computer programming are all considered simple in that we teach them to every engineering undergraduate, and teach the first and third to high school students.

As an example, we have been working on drug-disease models with Novartis for clinical trials. These models intrinsically involve differential equations to model drug diffusion and then relate them to observable effects, like visual acuity tests. There’s not a simple way to express such a model as a DAG. Yes, you can do it in PKBUGS, but not with the BUGS language for graphical modeling — you have to extend the notion to include nodes that are function definitions for the system of differential equations and then call an integrator to solve the differential equation. And then you need noise models for your noisy measurements to fit things like rate equations in the first place. Then you need to take into account that patients vary by sex and by age and by a host of other genetic and environmental factors that we don’t even have measurements for. And I almost forgot that we also want to take into account a meta-analysis of previous experiments, and have several drugs, dosages, and placebo controls.

It’s not that complicated, but it’s hardly smoking -> cancer.

Bob:

We agree. The thing is you come at Stan from the perspective of a tool to solve problems that previously were very hard. I come form the perspective that I would like to teach high school kids how to do science using something like DAGs + Stan. I think it could be very simple, kind of a Logo version if you will. So all I am saying is not to loose track of other great potential uses of the language, which could be as important.

BTW I have a side project looking at HIV transmission among high school kids in Western Kenya and was thinking of using an agent-based model. I am new to them but I think they can be very useful, and could usefully be coded in Stan I imagine. For example, Andrew’s 4-part model of toxicokinetics could be recast as an agent-based model where each organ is an agent.

Why do this? In the case of human behavior experts might have much better priors about the parameters governing human choices than a differential equation of disease transmission that results from their interaction. It can also simplify model specification, as simple behaviors can engender pretty complex phenomena. But again, I am new to this. Grateful for any pointers or suggestions.

@Fernando: It may be possible to wrap Stan with something simpler that could be used for teaching. But given all of our other goals, this isn’t even a low-priority item for us. Something simpler involving a few models and Javascript on the browser may be a better model than Stan for teaching.

@Bob:

Daggity already does exactly that, right?

http://www.dagitty.net/

@Rahu

@bob

I’m not making this ruckus to have some toy example javascript on a web browser. But I’m not going to spell it out. Just think daggity with a simulation and inference engine, one that knows about d-separation.

Fernando:

Why would you need such a complicated and indepth tool for High School students? I don’t see the necessity, I actually think that might even make things more complicated for them than you would want.

Daniel:

I disagree. The tool, as I see it, would actually be quite simple, and powerful. Besides one should not underestimate what kids can accomplish.

But don’t get me wrong, I am not a High School teacher. My point is that science, and causal inference in particular, is made unnecessarily complicated. You don’t need Stats 101-103 to do it. Using graphical tools like DAGs and a simple UI (though not a simplistic solution) even high school kids can address important problems and make important contributions. That is my belief. I draw inspiration form things like:

http://worrydream.com/#!/MediaForThinkingTheUnthinkable

http://worrydream.com/#!/LearnableProgramming

https://probmods.org/generative-models.html

One thing I liked about Harvard is that PhD courses were often open to all students, including some high school ones.