When to add a feature to Stan? The recurring issue of the compound declare-distribute statement

At today’s Stan meeting (this is Bob, so I really do mean today), we revisited the topic of whether to add a feature to Stan that would let you put distributions on parameters with their declarations.

Compound declare-define statements

Mitzi added declare-define statements a while back, so you can now write:

transformed parameter {
  real sigma = tau^-0.5;
}

It’s generally considered good style to put the declaration of a variable as close to where it is used as possible. It saves a lot of scanning back and forth to line things up. Stan’s forcing users to declare all the variables at the top violates this principle and compound declare-define clears a lot of that up (though not all—there’s a related proposal to let statements be mingled into the declarations at the top of a block like the transformed data and transformed parameters blocks).

Compound parameter declare-distribute statements?

Allowing compound declare-distribute statements in the parameters block would support writing linear regression as follows.

data {
  int<lower = 0> N;
  vector[N] x;
  vector[N] y;
}
parameters {
  real alpha ~ normal(0, 10);
  real beta ~ normal(0, 5);
  real<lower = 0> sigma ~ normal(0, 2);
}
model {
  y ~ normal(alpha + beta * x, sigma);
}

The current way of writing this would have the same data block, but the parameters block wold only have declarations—uses get moved to the model block.

parameters {
  real alpha;
  real beta;
  real<lower = 0> sigma;
}
model {
  alpha ~ normal(0, 10);
  beta ~ normal(0, 5);
  sigma ~ normal(0, 2);
  y ~ normal(alpha + beta * x, sigma);
}

Compared to the current approach, the proposed compound declare-distribute statements pulls the first three uses of the variables, which is to assign them priors, up into the parameters block where they are declared.

Who’s on which side on this?

I’m trying to remain neutral as to who has which opinion as to whether compound declare-distribute statements would be a good thing. (The implementation’s no big deal by the way, so that’s not even an issue in the discussion.)

Even more ambitious proposal

An even more ambitious proposal (still easy to implement) would be to open up the parameters block to general statements so that anything at all in the model block could be moved up to the parameters block, though the intent would have it just be priors. The reason this matters is that not every prior in Stan can be easily coded in a single line with a distribution.

Allowing statements in the parameters block makes the transformed parameters and model blocks redundant. So we’ll just have one block that I’ll call “foo” to avoid bike-shedding on the name.

data {
  int<lower = 0> N;
  vector[N] x;
  vector[N] y;
}
foo {
  real alpha ~ normal(0, 10);
  real beta ~ normal(0, 5);
  real<lower = 0> sigma ~ normal(0, 2);
  y ~ normal(alpha + beta * x, sigma);
}

It’d even be possible to just drop the block name and braces; but we’d still need to declare the generated quantities block.

Compound data declare-distribute statements

So far, we’ve only been talking about declare-distribute for parameters. But what about data? If what’s good for the goose is good for the gander, we can allow compound declare-define on data variables. But there’s a hitch—we can’t bring the sampling statement for the likelihood up into the data block, because we haven’t declared parameters yet.

The easy workaround here is to put the data declarations right on the variables instead of using blocks. Now we can move the data declarations down and put a compound declare-distribute statement on the modeled data y.

real alpha ~ normal(0, 10);
real beta ~ normal(0, 5);
real<lower = 0> sigma ~ normal(0, 2);
data int<lower = 0> N;
data vector[N] x;
data vector[N] y ~ normal(alpha + beta * x, sigma);

This is pretty close to Maria Gorinova’s proposal for SlicStan and something that’s very much like Cibo Technologies’ StataStan. It looks more like PyMC3 or Edward or even BUGS than it looks like existing Stan, but will still support loops, (recursive) functions, conditionals, etc.

Now what?

Our options seem to be the following:

  1. hold steady
  2. allow compound declare-distribute for parameters
  3. further allow other statements in the parameters block
  4. full ahead toward SlicStan

Stan’s basic design

To frame the discussion, I think it’ll help to let you know my thinking when I originally designed the Stan language. I was very much working backwards from the C++ object I knew I wanted. Matt Hoffman and I already had HMC working with autodiff on models defined in C++ at that point (just before or after that point is when Daniel Lee joined the coding effort).

I thought of the C++ class definition as the definition of the model—typically in terms of a joint log density (I really regret calling the log probability density function log_prob—it should’ve been lpdf).

The constructor of the model class took the data as an argument. So when you constructed an instance of the class, you instantiated the model with data. That conditions the model on the data, so an instantiated model defines the log density of the posterior.

The data block declaration was essentially a way to code the arguments to the constructor of that class. It’s like the function argument declarations for a function in a language like C or Java. (It works much more indirectly in practice in Stan through generic I/O reader interface callbacks; the declarations specify how to generate the code to read the relevant variables from a reader and define them as member variables in the model class.)

The parameter block declaration was similarly just a way to code the arguments to the log density function. (This also works indirectly in practice through readers by taking unconstrained parameters, peeling of the next one in the declaration by size, adding any Jacobian adjustments for change of variables to the log density, and defining a local variable on the constrained scale to be used by the model—that’s also how we can generalize Maria Gorinova’s proposal for functions in SlicStan that required compile-time unfolding).

The model block was just the implementation of the body of the log density function.

That’s it for the basics and it’s still how I like to explain the Stan language to people—it’s just a way of writing down a log density function.

Later, I added the transformed data block as a way to save intermediate values at construction time. That is, member variables in the class that would be computed based on the data. The transformed parameter block is the same deal, only for parameters. So they define transforms as local variables for the model block to use (and get printed).

Next up…

Breaking the target log density into the sum of a log likelihood and log prior (or penalty)…

22 thoughts on “When to add a feature to Stan? The recurring issue of the compound declare-distribute statement

  1. Bob:

    Matthijs and I discussed this for a few minutes in the Stan meeting after you left, and we have a new suggestion which connects the idea to strong typing (I think that’s the term you’d use) and informative priors.

    It goes like this: Currently, the declaration statement of a parameter defines its transformation and its starting distribution (i.e., the default distribution for the parameter if no other statements are given to supply a prior). For example:

    real a;    // no transformation, uniform prior on a
    real<lower=0> a;  // log(a) transformation, uniform prior on a
    

    But often we want to do something like this: set a parameter to a particular value, for example, ummm, whatever, say 8.4, but with some slack. We could do this:

    real<lower=8.4, upper=8.6> a;  // transformation to logit((a-8.4)/0.2), uniform prior on a
    

    But I don’t want to do this because we can often have problems with hard constraints, and it typically doesn’t make sense to use a hard constraint if you merely want to say that the parameter is close to 46. So instead we could do this:

    real a;   // no transformation, uniform prior on a
    . . .
    a ~ normal(8.5, 0.1);   // strong prior keeps a near 8.5
    

    The problem here is that the transformation is all wrong, in particular we have problems with starting values. Stan will start in the range near (-1, 1), and it can take awhile for the parameter to get to 8.5. This sort of thing can cause real trouble.

    So we want is to (i) transform to be near 8.5, and (ii) put in that normal prior.

    My idea at first was this:

    real a ~ normal(8.5, 0.1);   // transform to (a - 8.5)/0.1, normal prior
    

    The trouble is, how can Stan “know” to use this transformation (or this starting point)?

    So now I’m thinking we should bundle this. Consider the options:

    real a;    // no transformation, uniform prior on a
    real<lower=0> a;  // log(a) transformation, uniform prior on a
    real<normal; mean=8.5, sd=0.1> a;  // transform to (a - 8.5)/0.1, normal prior
    

    That last piece of notation is a bit ugly, but maybe you get the point, that I’d like the prior and the transformation to be bundled. Or at least to have the option of this linear transformation. One could always put in the prior “manually” in the model block of the Stan program, but it strikes me that priors, transformations, and starting values go together. Currently we have to set the prior in the model block, which is awkward enough, and then we have to also specify starting values in these cases where a parameter is far from 0 on the transformed scale.

    There can be other ways of doing this; the point is that, in this way of thinking, the prior and the transformations are aspects of the parameter. This relates to something else we were talking about for awhile, which was the idea of each parameter having its own scale. I think the idea of a “scale” can be subsumed into the larger idea of a “transformation.”

    After the Stan meeting I had the plan to write up the above notes as a document to share. But then I got busy and forgot about it. So I’m glad you wrote your post!

    • The translation and scaling could be rolled into our existing real type with different constraints:

      // transform to (alpha - 8.5) / 0.1
      real<loc = 8.5, scale = 0.1> alpha;
      

      We could use something like “z” or “linear” instead of “real” if the overload would be confusing.

      Do these scales and upper and lower bounds interact? Do you want to start scaling types like simplexes, covariance matrices, etc.? What would that look like?

      This particular is really just a linear transform, so it could alternatively be coded as

      // transform to alpha * 10 - 85 = (alpha - 8.5) / 0.1 
      linear<intercept = -85, slope = 10> alpha;
      

      That would seem like a more natural type, but it produces too much algebra homework to rescale.

      I wouldn’t think you’d want to make the prior automatically normal(0.85, 0.1) just because you scale the variable linearly to that scale.

      • Bob:

        To answer your last question: I guess I was thinking about it the other way, that if my prior is normal(8.5,0.1), then I’d like the transformation to be done automatically.

        So the idea is to think of the default transformation as being attached to the prior:

        – Flat prior on (-infinity, infinity): the identity transformation
        – Flat prior on (0, infinity): the log transformation
        – Flat prior on (0, 1): the logit transformation
        – Normal(0.85, 0.1) prior: the linear transformation, (theta – 8.5) / 0.1

        • Do you think we could extend this to multivariate ad-hoc transforms? We know how to do the nested calculations now that we could do things like compute the Jacoians using autodiff for most transforms.

        • Bob:

          I’m not sure. I guess we could make a list. The idea would be to think of the model as being set up in terms of the prior, rather than the transformation.

  2. What’s the downside to option 4 (“full ahead toward SlicStan”)? Because that looks beautiful to me. It feels so natural–exactly the way my brain wants to process the model.

  3. Raymond Chen (“The Old New Thing” blog) has often written that in Microsoft, each change or feature proposal starts out with -100 points. Pros and cons add or subtract points. No proposal can be implemented unless at the least it gets into positive point territory; the ones with the most points are allowed to be implemented.

    I don’t recall reading how they choose the size of the point additions, and I suppose that calibration would vary between companies and projects. But the general idea seems like it has merit.

    • Exactly! I didn’t see your reply before replying to David J. Harris above, but that’s exactly the point I was trying to make.

      The main negative is that there’s now multiple ways to do thing and the user has to make a choice. There has to be more doc. The positive is that it can put usage a lot closer to declaration, which is good.

      And I’m not talking about breaking backward compatibility. That’s just adding the intermediate step. Breaking backward compatibility means everyone using Stan models and writing them now has to recode them. We can provide tools to translate, but then that’s another thing to learn/install. Or we can keep supporting both with an automatic tool, but that’s also a headache for us, though better for users. Then there are tutorials, training materials, books, case studies, etc. That’s a lot of investment.

  4. My reaction to the SlicStan is that I’d want to see a few interesting models (i.e. not a simple GLM(M)) coded that way. What does a HMM forward algorithm look like? Still just looping code?

    My experience with complex models is that the blocks help. But I am sure you’ll keep it all backwards compatible, so maybe no reason to worry.

    The SlicStan syntax looks like what I’ve been toying with as intermediate code for the next full version of map2stan. I needed some parsing between the sloppy R side and the strongly typed Stan side, so started adding those tags for variable types and pushing them into the right Stan blocks from there.

    So if you go SlicStan, you almost write the next version of map2stan for me. That’s sounds good!

  5. I appreciate what SlicStan is trying to do, but I would be very hesitant to see such a dramatic transformation of the language. I really like the conceptual distinction between the current model blocks, and I would prefer for Stan to maintain this structure (including distinct model block).

    I’d love to see SlicStan implemented as a more flexible alternative that automatically translates into block-Stan. This would maintain backwards compatibility while accommodating user choice.

    • This is the problem we face. The current setup is fairly well motivated and the new thing would be a huge change.

      We’d actually write the new language, let’s say Stan++ for now, t compile directly to C++. It’d be much easier to go from Stan back to Stan++ than the other way around. It also admits future growth.

      The big obstacle is the same thing I’m worried about with allowing sampling in the parameters block—there become multiple ways to do things so choices have to be made. Choices like this are particularly hard for thoughtful beginners. And it means we have to support multiple forms of documentation. Then one person’s way of writing things isn’t going to be very interpretable to others. That’s my big fear—going down the Python 2.x vs. 3.x thing.

  6. I’m not sure where I think the language should end up, but I strongly support moving away from the current situation, which I really don’t like because of the fact that hard constraints on paramater values are separated from the rest of the parameter distribution. Uh, that was probably incomprehensible. Let me give an example. I am not going to try to get the angle brackets right because I know I’ll get it wrong and my comment will be unreadable, so I’m replacing with parenthesis, sorry.

    I often do this:

    parameters {
      real<lower = 0> gamma; 
    }
    model {
     gamma ~ normal(0, 2); // Weak prior on sd of TOD effects (note: half-normal)
    }
    

    If I don’t leave myself that comment, I am for sure going to forget that I constrained gamma to have a lower bound at 0.

    So I really want to be able to say, somewhere,

    gamma<lower = 0> ~ normal(0, 2)
    

    Whether there should be other changes too, and what they should be, I dunno. But I know I want to see this one.

    [edit: escaped code]

  7. I definitely prefer the idea of blocks rather than everything in one place for two reasons: its easier to learn from other people’s code and because it’s somewhat self-documenting.

    When I try to build code I look at other people’s code. Often times I don’t care about the details but rather how they accomplished some particular point (e.x. handled the prior on a variance parameter). With blocks I can easily ignore the pieces that don’t matter to me (such as the data, transformed data, and generated quantities.) If it was all mushed together it would take more brainpower than I’d want to devote just to figure out what was being done.

    Second, blocks are a way of enforcing documentation of intent. Its easy to see that ‘x’ was supposed to be an input if its in the data block and alpha is a parameter if its in that block. Obviously it makes refactoring sometimes more difficult (for example when moving a parameter to the data block to make a simpler model for debugging purposes.)

    To help both reading other’s code and documenting my own work, it would be useful to be able to put priors along with the parameters (and constraints/transformations on the parameters).

    Along this line, perhaps adding an ‘initial parameters’ block which can assign starting parameters would be helpful.

  8. I really like the approach Elm (http://elm-lang.org/) takes to language features. Nothing is ever added/changed before there is a bunch of examples of actual code people use/want to write and the way it will be affected by the change. And I think it made Elm a nice little language. Theoretical arguments are nice, but evidence-based decisions are nicer. So yes, let’s move away from linear regression to more interesting models – how are those affected by the proposed changes? For example in my splines + differential equation model (https://github.com/cas-bioinf/genexpi/blob/7be629027ca012fd424cc5c282956e2e47bf0263/genexpi-stan/multiple_targets_splines.stan) moving priors to parameter declarations would probably reduce complexity, but only slightly as the most challenging part is the splines+solver code.

    Also I would like to point out that block membership also determines which values are stored per sample. I have frequently moved stuff from transformed parameters to model block (or to a new scope within transformed parameters) to save memory in large models. If you move towards SlicStan, how do you determine which variables are stored and which are only temporary?

    • Martin:

      Just to be clear: these suggested improvements in Stan arose from real applications, actual code in real problems we are working on. One issue was models that were becoming hard to read and hard to add to; another issue was the difficulty, using the current setup of Stan, to include strongly informative priors. As discussed in my comment in the above thread, to introduce a strongly informative prior for a parameter I had to add several lines of code and either introduce an awkward reparameterization or else play around with initial values, which is awkward in its own way as it necessitates altering the code that calls the Stan program

Leave a Reply to Bob Carpenter Cancel reply

Your email address will not be published. Required fields are marked *