Skip to content

“Don’t get me started on ‘cut'”

Brendan Rocks writes:

I have a request for a blog post.

I’ve been following the debates about ‘cut’ on the Stan lists over the last few years. Lots of very clever people agree that it’s bad news, which is enough to put me off. However, I’ve never fully groked the reasoning. [I think that should be “grooked”—ed.]

I’ve considered using it for sub-models which use priors to ‘tidy-up’ estimates, before passing them to a “different” model (while propagating uncertainty).

The explanations about why it’s bad news make sense in a general conceptual way (inference should flow both ways in Bayes). However, it’s difficult to think instinctively about how this might affect results, or what the pros and cons of avoiding cut are (beyond the conceptual benefit, and the computational cost).

Can you think of a cute example to illustrate how cut make might a ‘gotcha’ difference to a model?

Are there ever situations where you think it could be conceptually defensible?

My reply: From the modeling side, the problem of imperfect aggregation or transportability arises in many application areas, in particular pharmacology. The general setting is that a model is fit to a dataset y, giving inferences about parameter vector φ, and then it is desired to use φ to make inferences in a new situation with new data y′. The often-voiced concern is that the researcher does not want the model for y′ to “contaminate” inference for φ. There is a desire for the information to flow in one direction, from y to φ to predictions involving the new data, but not backward to φ. Such a restriction is encoded in the cut operator in the Bayesian software Bugs. We do not think this sort of “cutting” makes sense, but it arises from a genuine concern, which we prefer to express by modeling the parameter as varying by group. Hence we believe that the introduction of a shift parameter δ, with an informative prior, should be able to do all that is desired by the cut operator. Rather than a one-way flow of information, there is a two-way flow, with δ available to capture the differences between the two groups, so there is no longer a requirement that a single model fit both datasets.

Here’s an example from a recent paper with Sebastian and others.

In answer to Brendan’s specific question: Yes, I’ve seen examples where the model for y does not fit the data y′, and something needs to be done. In my experience, instead of “cutting,” it works better to expand the model until it fits both datasets.


  1. Alex D says:

    I feel like it’s easier to see pitfalls here if you state the implicit human-language assumption that goes into using data y to help draw conclusions about the generating process for a different dataset y’ (rather than stating the relationships in terms of mathematical operators, or the “right” way to allow information to flow in Bayesian inference). To claim that y is relevant to y’, you must believe that some part of the generating processes is shared between the two data generating processes; likewise, to claim that y’ gives “contaminating” information about y, you must believe that some other part of the generating processes differs between the two contexts. To share information appropriately, you need to understand and specify which parts of the joint model for y and y’ are shared between contexts and which are context-specific. This is what you are doing when you add context-specific parameters, as Andrew suggested.

    The ‘cut’ operator does not do this — even though you’re cutting a flow of information from y’ -> y, in the y -> y’ direction you are still using the y-specific part of the generating process for y to inform your understanding of the y’ generating process.

  2. Kevin Dick says:

    I actually think it’s “grokked”, not “grooked” or “groked”. That’s how Heinlein originally used it, so that’s how I grok it.

  3. JV says:

    I’ve only ever wanted to use ‘cut’ to generate posterior predictives for use in graphical displays of model (mis)fit. It had never occurred to me to use ‘cut’ to isolate parameter estimates from a second data set. The former use is still practical, I think — but I now end up getting posterior distributions out of the sampler and then generating the predictives with a different program.

Leave a Reply