The network of models and Bayesian workflow

Posted on October 23, 2017 12:53 AM by Andrew

This is important, it’s been something I’ve been thinking about for decades, it just came up in an email I wrote, and it’s refreshingly unrelated to recent topics of blog discussion, so I decided to just post it right now out of sequence (next slot on the queue is in May 2018).

Right now, standard operating procedure is to write a model, fit it, then alter the model and give it a new name, fit it, alter the model again, creating a new file, etc. We’re left with a mess of models in files with names like logit_1.stan, logit_2.stan, logit_2_interactions.stan, logit_3_time_trend.stan, logit_hier.stan, etc., that are connected to each other in some tangled way.

This could be thought of as network of models, i.e. I’m picturing a graph in which each node of the graph is a statistical model (instantiated as a Stan program), and the edges of the graph connect models that are one step away from each other (for example, adding one more predictor, or allowing a coefficient to vary by group, or adding a nonlinear term, or changing a link function, or whatever).

So I’m thinking what we need—we’ve been needing this for awhile, actually, and I’ve been talking about it for awhile too, but talk is cheap, right?, anyway . . .—is a formalization of the network of models.

Some possible payoffs:
– Automatic, or at least machine-aided, model building.
– A potentially algorithmic tie-in of model expansion and predictive model checking.
– Some better way of keeping track of multiple models. Using file names is just horrible.
– Visualization of workflow, which should help us better understand what we’re doing and be able to convey it better to others.

I’m thinking this would make sense, at least at first, to set this up as a R or Python wrapper.

22 thoughts on “The network of models and Bayesian workflow”

Ben Goodrich on October 23, 2017 1:37 AM at 1:37 am said:

What is mostly needed is for people to write a single Stan program that is capable of doing any combination of those modeling characteristics. Start the Stan program like

data {
int N; // number of observations
int K; // number of common predictors
matrix[N, K] X; // design matrix

and now you can add predictors to X without changing the Stan program. One of those could be a time trend, or the product of two other columns, whatever.

There are a couple of different ways to add variation by group. If you only allow the intercept to vary by group, then it is easiest to pass in an array of group IDs and use Stan’s indexing (which is like R’s indexing) to do stuff like

vector[N] eta = alpha[group_ID] + X * beta;

If you are allowing multiple things to vary by group in a correlated fashion, it is more difficult to do the indexing, although the **brms** package manages to do so by generating the Stan program at runtime. The **rstanarm** package comes with compiled Stan program and thus has to rely on the **lme4** package to create a Z matrix that encodes all the group deviations, so you get something like

vector[N] eta = alpha + X * beta + Z * b;

In the **rstanarm** package, we go a bit overboard in order to keep the number of Stan programs small and to re-use common pieces across Stan programs. So, we have a Stan program for the entire Bernoulli family (and conditional logit), rather than just a logit model
https://github.com/stan-dev/rstanarm/blob/master/exec/bernoulli.stan
that also stratifies the design matrices depending on whether the outcome is 0 or 1. But that is the idea.

Reply ↓
- Andrew on October 23, 2017 1:43 AM at 1:43 am said:
  
  Ben:
  
  I’m with ya on that, but no matter how clean and comprehensive your program is, I’ll still be wanting to fit multiple models, if for no other reason than there can be a model I really want to fit, but I have to build up to it.
  
  For example, in my recent consulting project, we had ideas of a model that incorporated several sources of information over several years. For reasons of computation but, even more, of lack of understanding and experience with these models, I felt it was essential to build up the model step by step. So I ended up with a sequence of models.
  
  At this point you might say that when building up the model I should be creating a single model, just with flags to turn on and off various features. And if you said that, you might well be right, that this is the way to go, to do it within a single Stan program.
  
  Still, I think there’s a Cantor-like way in which we’ll always be in the position of creating new models. And, for that matter, if a single Stan model has 5 different flags and thus 32 different possibilities, we’d still want some way to structure all these fits.
  
  But what I do take from your message is that the idea of a network of models is important, but in Stan it doesn’t necessarily have to be set up as a network of Stan programs; it could be a single Stan program with a network of internal possibilities.
  
  Reply ↓
- Rahul on October 23, 2017 4:50 AM at 4:50 am said:
  
  Isn’t that the trap of design creep? It is hard to make a very general model that takes care of all refinements right at the outset.
  
  Reply ↓
Opher Donchin on October 23, 2017 2:31 AM at 2:31 am said:

Hi,

I think this is important, Andrew, and that Ben’s approach is only useful in some circumstances.

To me, it seems like the right analogy is to software versioning systems. What you want have is a commented history of the models you’ve run and the samples you drew from them (and possibly some of the posterior or predictive distributions).

In this history, you should have the option to create branches, or, potentially, to merge them.

You may have several active versions you are working on simultaneously, so there will need to be a check-in and check-out procedure.

In the end, you will want to converge to the model you feel that best represents the data. It should be simple and easy to understand and it should not be overburdened with flags representing the entire history of your thinking or an exhaustive exploration of the possibilities.

On the other hand, important alternative models that were rejected should still be accessible. They may need to be updated to produce the posteriors and predictives that you decided finally were the best way to represent your perspective on the data.

I think current open source versioning software could be modified to achieve these goals. This would hopefully be general enough that it was not Stan or R or Python specific, but could be used for anyone’s efforts to develop a Bayesian model of their data.

Thanks for the blog! Always fascinating!
Opher

Reply ↓
Emmanuel Charpentier on October 23, 2017 5:26 AM at 5:26 am said:

Hmmm… aren’t you at risk of (re-)inventing “stepwise modeling” ? With the obvious danger of opening to human laziness the comfortable path to intellectual hell that stepwise regression once opened ?

Reply ↓
- Andrew on October 23, 2017 6:10 AM at 6:10 am said:
  
  Emmanuel:
  
  Nope, I’m not reinventing anything here. It’s unavoidable that we’ll be fitting multiple models to the same dataset: this describes just about every application I’ve ever worked on in my entire professional life. We’re doing it anyway so we might as well think about how to do it better. I agree that we don’t want to be doing stepwise regression. If for some reason you want to be fitting a subset regression to data, I recommend regularized horseshoe.
  
  Reply ↓
  - Bob Carpenter on October 23, 2017 9:12 AM at 9:12 am said:
    
    For better or worse, we’ll never be able to build what Andrew wants in general because the combinatorics are intractable and Stan’s too slow. That’s true even if we’re at a “big tech” company and have 100K processors in our pocket and restrict our attention to the kinds of regression models in Gelman and Hill’s multilevel regression book.
    
    What Andrew is suggesting is a semi-automatic (as opposed to automatic) generalization of automatic stepwise regression. If we could build it, it would be abused by people who didn’t quite understand what was going on and the risks involved. That’s also true of using MCMC and Stan, but it didn’t stop us from building Stan. Whatever we do with Stan to try to protect people from shooting themselves in the foot (e.g., not allowing real to integer conversions to avoid discontinuities; avoiding intractable discrete parameters), our users just yell at us for hamstringing them (open-source software users are oddly demanding given that we’re giving things away for free).
    
    I have some experience with this problem, as it’s what Andrew originally hired me and Matt Hoffman to try to solve (I believe Andrew gave it a poetic name like “unfolding flower” or some such). After a couple months, we gave up because of the combinatorics of even doing this for simple one-level multilevel regression models. Let’s say you have hundreds or thousands of predictors (election data) or even millions or billions (large e-commerce, natural language). What does this “network of models” look like and how would we navigate it? A 1000-way branching graph that’s 1000 elements deep. Then Andrew wanted us to look at one-way, two-way and multi-way interactions plus hierarchical components. I don’t think we ever did manage to convince Andrew it’s intractable, judging from the fact that he is still asking for it, but maybe he’ll comment with a more tractable suggestion that someone could actually build for a limited case.
    
    As to building a single Stan model as Ben Goodrich suggests, that’s also difficult. It can work in some cases as Ben’s amply demonstrated with RStanArm. But Matt and I could never figure out how to generalize to the interactions case in a way that we could even write down in LaTeX in a way that we could follow. That’s not saying Ben couldn’t do it, though—he’s much better at this than Matt or me.
    
    In a simple case, like the repeated binary trial case study I wrote, there are a few of these moves I make, but many many more I didn’t make. I start with a very simple complete pooling binomial model with beta priors. Then I consider the no-pooling version (just three symbols different in Stan, so “easy” as far as this goes). Then I consider a hierarchical model, but there are a lot of choices of prior here. Andrew tells me he doesn’t like the Pareto prior he used for the hierarchical beta parameters in the rat clinical trial example in chapter 5 of Beedeeay trois, but rather than suggesting a replacement, just suggested avoiding the beta-binomial formulation and looking at it as a logistic regression instead, which I do in the case study after going over his dispreferred prior. And repeated binary trial data with no predictors available at all is about as simple as modeling gets, and already we have this huge bush of possibilities, not a beautiful flower.
    
    Now let’s suppose we’re in the pharmacology world. We have things like the stiff and non-stiff solver for ODEs. Then we have approximations that work for linear ODEs (matrix exponential), but need completely different code (RK45 or BDF numerical integrators) when things get non-linear. So add a non-linear term to your ODE, the code changes dramatically. Then there’s the issue of how many compartments to model (one compartment, blood vs. other, blood/bone/fat, etc. Then there’s all the hierarchical and regression predictor stuff on top of that (how old is the patient, what’s their sex, where do they live, what’s their diet, etc. etc.) I just don’t see how we can generalize this to be useful for Stan in general.
    
    Now look at discrete parameters. We might want to turn a simple distribution into mixture. That’s a valid “step” in this network of models. All of a sudden, the code looks very different at the Stan level. You can blame Stan’s lack of encapsulation of model components, lack of discrete parameters, or just overall clunky way of expressing models, but I can’t even imagine how this would work with no holds barred blue-sky, write a design on a piece of paper kind of thing, nor has Andrew ever presented anything like a wireframe GUI or text interface for what he’d actually want this to look like. Asking for a rough example of how functionality would work is my first line of pushback whenever Andrew or anyone else suggests something—show me what it’d look like in practice; the second line of defense is asking for multiple examples to see how they relate; the third line is to ask for an actual functional spec.
    
    What it comes down to is that it’s easy to present a compelling picture of something at a very vague level like Andrew did in the post above. Everyone would love to have some of their grunt work automated. I completely agree what Andrew’s talking about is a “pain point” for using Stan. But it turns out it’s not so easy to automate. Nobody even has a sketch of a design of what this would look like in practice (other than what Ben’s proposing above). As soon as we got to trying to write down some language for exploring models, Andrew wasn’t happy because it was just too messy.
    
    I’m pessimistic anyone will be able to solve this problem given the combinatorics. So I wouldn’t recommend it as a Ph.D. thesis topic.
    
    Reply ↓
    - Emmanuel Charpentier on October 24, 2017 10:20 AM at 10:20 am said:
      
      Dear Bob,
      
      Thanks for this long and thoughful answer, which gives me more “food for thought”. I’ll have to get back to this after (trying to) formalize a bit what I had in mind which is close to, but a tad different from what you described.
      
      However, very briefly and inadequately : there may be solutions to the combinatorial explosion, by astute use of the constraints of problems (see how algebra and number theory interfaced to solve seemingly untractable problems…). But the *real* problem is that the space of “possible model elements” (i. e. the “pieces” we’re trying to combine) is not, as far as I know, descriptible (unless you’re willing to reinvent scholastic universals or Leibnitz’s calculus ratiocinator).
      
      I’d rather think in terms of “shapes”, or “structures”, i. e. for which an explicit description *may* exist and *may* even be finite, but for which the description of the scientific proclem *should* give strong constraints, thus greatly restricting lte combinatoroal explosion.
      
      Again, I have to think about this a bit more beforewriting anything else…
    - Andrew on October 24, 2017 10:26 AM at 10:26 am said:
      
      Bob:
      
      I don’t think it would be possible to solve all the big problems here all at once! Maybe I didn’t make it clear that where I’d like to start is with some tools for linking, displaying, and comparing a small number of models (between 2 and 10, say) fit to the same data. When I speak of a network of models: Yes, there is a combinatorial explosion of the network of possible models, but I was just thinking of looking at the network of models of interest, which for a particular application might be the models along one or two pathways leading from a simple starting model to the larger model of interest.
    - Reuben Johnston on May 14, 2018 4:35 PM at 4:35 pm said:
      
      Hi,
      
      Sorry for posting on an old thread, but I have something that might be useful.
      
      MCMCBayes could very easily be extended to run multiple variations of a model to compare performance. It’s object-oriented MATLAB and can automate sequences of MCMC sessions. I used it with my dissertation research to compare performance across ten models and to perform Bayesian model averaging. It has two key constructs, that I call simulations and sequences. The former is for running MCMC on a single model and dataset. The latter is for doing computations with sets of simulations, and both are extensible classes. Works with Stan, JAGS, and OpenBUGS and runs on Windows, Linux, and MacOS. The executions and post-processing analyses are controlled using *.xml with all the data and settings and simple one-line commands from MATLAB.
      
      https://gitlab.com/reubenajohnston/mcmcBayes
      
      Regards,
      Reub
    - Jake on October 26, 2017 8:27 AM at 8:27 am said:
      
      How different is Andrew’s idea to something like genetic programming / symbolic regression? Like in https://deap.readthedocs.io/en/0.7-0/examples/symbreg.html
Simon Martin on October 23, 2017 6:43 AM at 6:43 am said:

I may be misinterpreting this, but Andrew’s description of his idea reminds me of the visual programming language LabView–if you don’t know it it is a bit like programming via flow diagrams. I guess the nearest data related package is Orange (which has the advantages of being both python based and free). The models etc. in Stan could be wrapped into/by Orange widgets and joined together in the appropriate way to make the analysis (workflow in Orange speak).

Reply ↓
Anonymous on October 23, 2017 9:41 AM at 9:41 am said:

Reminds me of this project http://pennai.org/

Reply ↓
Ben Goodrich on October 23, 2017 9:43 AM at 9:43 am said:

We certainly didn’t write the programs in **rstanarm** in one sitting. And it was important to use Git along the way as we were (and are) adding options to the Stan programs that seemed useful to people.

But I don’t think it would be a particularly good workflow to (ab)use Git so that you had multiple tags, one for each variant of the model you wanted to run. So,

git checkout time_trend
# run model with time trend
git checkout multilevel
# run model with group-level variation
…

is not very appealing, in part because you would have to recompile the Stan program after every git checkout but mostly because that isn’t the use case version control was designed for.

Reply ↓
Donald R Williams on October 23, 2017 10:25 AM at 10:25 am said:

Hi:
Not quite what you are describing here, but I am developing a package bnets (Bayesian network models via Stan). It estimates regularized partial correlation networks with Lasso, ridge, and horseshoe. Also has global and node wise ppc, loo, waic, and will have stacking and weights!

Reply ↓
Kenneth Holstein on October 23, 2017 10:30 AM at 10:30 am said:

It seems to me that this ties in to the design of better tools for exploratory programming (it looks like Simon Martin, above, had a similar interpretation).
I know Mary Beth Kery (http://marybethkery.com/), at CMU, is working on this problem from a Human-Computer Interaction perspective. This paper from CHI last year, for example, studies how data scientists and academic researchers tend to create an organize “versions” and “variants” of models in practice (in a rather messy exploratory process):
http://marybethkery.com/resources/Papers/variolite-supporting-exploratory-programming.pdf

Reply ↓
Rob MacCoun on October 23, 2017 2:06 PM at 2:06 pm said:

At the risk of being self-serving, here’s a possibly relevant passage from my recent book chapter on computational social psychology: “….it is not clear why we even
need to be parsimonious. Computation is now extremely cheap and relatively easy. Rather than using a handful of empirical tests to “eliminate” models, we
can share our modeling code and routinely examine a whole set of plausible models, discovering how they perform in different situations—and in the process,
familiarizing our junior collaborators with a range of ways of conceptualizing the domain. Following this approach, models may remain viable contenders for
much longer, but when we finally abandon them, we will have a firm sense of why we are doing so and why they were wrong or incomplete.”

Reply ↓
Sohier on October 23, 2017 3:31 PM at 3:31 pm said:

At least for Python, why not use an existing workflow scheduler like Luigi (https://github.com/spotify/luigi) or Airflow (https://airflow.apache.org/)? That provide a huge amount of functionality off the shelf.

Reply ↓
Sohier on October 23, 2017 3:33 PM at 3:33 pm said:

Have you looked into adapting existing workflow schedulers, like Luigi or Airflow? Seems like that would save you a ton of effort.

Reply ↓
Jonathan on October 24, 2017 3:29 PM at 3:29 pm said:

Regarding payoff #3 (better way to keep track of multiple models), the purrr package in R makes running multiple models easy, by storing everything in a data frame. It’s made possible because R lets you store anything in the columns of a data frame (as a list), not just actual data points. So you can have one data frame with a column that contains the data, a column containing the stan code, and a column containing the model fit object, with each row corresponding to a different model. Results/plots/etc can also be stored in columns of the data frame. This makes it much easier than keeping things in separate files, or even as separate objects in the R workspace, because you don’t have to remember which fitted model object corresponds to which stan code file, for example.

The “network of models” could be coded in a way where the information regarding the network is stored as separate columns in this data frame, with each model/node as a row of the data frame.

Reply ↓
LauraK on October 26, 2017 7:01 AM at 7:01 am said:

Hi. So this reminded me of a project I worked on as a usability tester in the early years of the World Wide Web called padprints, a zooming web browser . I like this idea of building a network as we code and make decisions. We often cut and paste code and make slight Changes to models but often do not really track that ourselves, although we should.

This idea of a network reminds me of how I program in SAS enterprise guide. One can see a network of programs visually linked in a series of paths and put them in paths within a project. However, I would need to do the linking myself, not hard but it is not automatic. Also there is not tracking of cut and paste make slight change.

Your way seems more automated, like what original model was copied to get a starting place for the next model. So as we copy and make slight changes we could track that. I like this idea of the software tracking it for us, not us documenting it because often I try, make slight change, run, etc. all the runs could be stored but the visual display and ability to navigate is key. Perhaps this should be separate from the automated part of model selection?. Basically think about the tracking of our decisions separate from the automation part?

I think with the automation part your idea is very similar to SAS enterprise miner. You have model nodes in your network.

Reply ↓
Vicky on October 27, 2017 12:26 PM at 12:26 pm said:

Looks like VisTrails might be the solution you’re looking for — https://www.vistrails.org/

Reply ↓

Statistical Modeling, Causal Inference, and Social Science

The network of models and Bayesian workflow

22 thoughts on “The network of models and Bayesian workflow”

Leave a Reply Cancel reply