I (Bob, not Andrew) doubt anyone sets out to do algebra for the fun of it, implement an inefficient algorithm, or write a paper where it’s not clear what the model is. But…

**Why not write it in BUGS or Stan?**

Over on the Stan users group, Robert Grant wrote

Hello everybody, I’ve just been looking at this new paper

[ed. paywalled link, didn't read]which proposes yet another slightly novel model for longitudinal + time-to-event data, and wondering why they bothered with extensive first-principles exposition of the likelihoods and then writing a one-off sampler in R. (Of course, the answer probably is a combination of the authors’ mathematical expertise and the journal’s fondness for algebraic exposition.) Why not just write it in a graphical model style like Stan? They report getting 2000 iterations every 24 hours, and I find it hard to believe this is a sensible way to do it. Is there any reason why this sort of model can’t be done in Stan?

I replied

It’s not iterations, but time to convergence and then time per effective sample after that that matters.

Stan’s not just a graphical modeling language — you can write down any expression that’s equal to the log posterior up to an additive constant.

My general experience is that it’s usually easy to fit models that have continuous parameters in Stan. We’ve fit all kinds of models that people have written complicated software for. Most of those models could’ve been fit in BUGS or JAGS, too, which would be easier than writing custom code.

Robert replied

… I feel that journal readers are poorly served by pages of algebra and no straightforward BUGS/Stan code. With good checking, it should be ok to present a new model in this way. After all, I don’t have to write down the likelihood and Newton-Raphson every time I publish

a logistic regression.

I am always struck by this same issue. Here’s what I think is going on:

1. **What goes in a paper is up to the author**. If the author struggled with a step or found it a bit tricky to think about themselves, then the struggle goes into the paper. Even if it might be obvious to someone with more experience in a field. I was just reading a paper with a very detailed exposition of EM for a latent logistic regression problem with conditional probability derivations, etc. (JMLR paper Learning from Crowds by Raykar et al., which is an awesome paper, even if it suffers from this flaw and number 3.)

2. **What goes in a paper is up to editors**. If the editors don’t understand something, they’ll ask for details, even if they should be obvious to the entire field. This is agreeing with Robert’s point, I think. Editors like to see the author sweat, because of some kind of no-pain, no-gain esthetic that seems to permeate academic journal publishing. It’s so hard to boil something down, then when you do, you get dinged for it.

I find lots of exposition of basic notions in applied papers in natural language processing, where almost every paper that uses a notion like entropy (including one I just wrote) redefines it from scratch. For that same paper, the editors wanted to know about algorithms for fitting a latent logistic regression. For some journals, you have to insert p-values even if they don’t even make sense from a frequentist perspective.

3. **Authors tend to narratives over crisp expositions**. They want to tell you their story of how they got to the conclusions. (Even me, hence the preface of the Stan manual.) One of the things that frustrates me to no end in papers, particularly in statistics, is that the author tells you a narrative of how they meandered to the model they finally use; in computer science, they meander in the same way to models or algorithm variants. What I want to see is a crisp statement of the model with enough details that I can replicate it. Providing working code is the easiest way to do that. It’s why I liked Jennifer and Andrew’s regression book so much — they gave you (mostly) runnable code.

**The way forward**

I think what applied papers need to do is lay out their model crisply. BUGS is a nice language for this if your model is short and easily expressed as a directed acyclic graph. I think the variable declarations in Stan make it even clearer what’s going on, especially as models go beyond a few lines.

The separation of data and parameters in Stan makes it evident how the model is intended to be used for inference — that information is not part of the model specification in BUGS. But this is also a drawback to Stan’s language in that the distinction between data and parameters is not part of a Bayesian model per se, just a statement of how you’re going to use it. One upshot of this is that missing data models in Stan are awkward, to say the least. (Hopefully the addition of functions should clean this up from a Stan user’s perspective, but it’s still not going to provide an easy-to-follow model spec like many models in BUGS. A possibility we’ve been considering is adding a graphical model specification language that could be combined with data to compile a Stan model — the issue is that we need the graphical model plus the identity of variables provided as data in order to compile and then run a Stan model.)

**Goldilocks and the Three Academics**

For some reason, mathematicians seem to be immune to problems (1) and (3). In fact, I feel they often go too far, which is why this is such a Goldilocks issue. What’s an inscrutable step or bit of math to one person (contact manifolds anyone?) is like counting on your fingers to a more advanced mathematician (or statistician). You can’t write an intro to differentiable manifolds (or hierarchical regression) into every one of your papers. So you have to write books and software that provide an infrastructure for writing simple papers, but then you get dinged because it’s “too easy” or alternatively because people don’t read the book and it’s “too hard.”

P.S. I finally met Phil of “this post is by Phil” fame! He said that no matter how he qualifies his posts, many commenters assume they’re written by Andrew!

Bob, thank you for your post. It was very insightful! The issue about what and how much to explain about your statistical approaches can be especially complicated if you’re writing for your colleagues in your specific field if you’re using more complex models. You can’t expect them to know them, you also can’t expect them to be patient about reading lots of technical background. You always have to find some kind of middle ground between explaining the approach thoroughly and not become too technical. Even pure methods paper in Sociology tend to avoid technicalities like Algebra maybe a little bit too much.A detailed discussion on how to program it in a specific software and programming language might also be problematic as it might exclude people, who are not comfortable with the specific software and programming language.

P.S. There are people who actually like to do Algebra ;-)

Bob:

Agree with you. Once someone told me that the way to get published is to write something so complex and intimidating reviewers will be shocked and awed. I think it actually works. And is also one reason why publishing DAGs is hard. Editors likely think you are doodling.

PS on slide 20 of your recent presentation (linked in a previous post), under Stan statements. Have you considered having not just “assignment” but “causal assignment”? Not the same thing. I think it can open many possibilities.

In my experience, reviewers will reject things that they don’t understand at all.

What is “causal assignment”? In Stan, variables that can be assigned to are just like programming language variables.

That sounds more a feature than a bug. What use does a review have if a reviewer starts blindly accepting things he does not understand?

Bob:

A causal assingment is a way of encoding a causal diagram in a programing language bc y = x is not the same as y := x, where the later is interpreted as x causes y say.

Once you have that coded in you can use d-separation algorithms on a system of equations to, for example, correctly impute data. But I don’t know enough about STAN to know how this might play out.

PS Put differently, if you give me a text file with a Stan model in it I cannot tell by the syntax alone whether the model is purely predictive (e.g. ice cream sales predict drowning) or causal (ie cream sales cause drowning). IMHO this is an important ambiguity.

Sure you could add a comment at the top stating whether the model is causal or predictive, but the idea is to declare it to the environment as such, so the causal structure can be exploited algorithmically.

Re rejection I agree with you in general, my comment was tongue in cheek.

But at times people (myself included) are not aware they understand (or not). Much like “unknown unknowns”. Presumably much of the criticism in this blog about published NHST falls into this latter category. (“understood misunderstandings”, or “misunderstood misunderstandings”?)

Fernando, Bob:

A natural way to implement causal inference in Stan as is, would be to formally define the joint distribution of (y^T, y^C)—that is, the outcome under the treatment or the control—given any predictors in the model, then it should be able to just crunch through everything. Really this should be part of a larger project of implementing causal models such as instrumental variables as measurement error models in Stan. And a valuable project this would be, I think!

Andrew:

Your suggestion works but the problem is practical.

Potential outcomes proliferate tremendously in any marginally complicated problem.

Fernando:

I agree with you completely about the proliferation of potential outcomes. One reason I think it would be good to program up some of these examples in Stan is so that we can more easily and routinely reach the research limit in these settings, and then this might lead the way to further development in this area.

Just as, by analogy, the routine use of hierarchical models for large and complex problems has motivated lots of research on topics ranging from prior distributions to model checking to measures of model fit.

Andrew:

If I understand you correctly I guess the issue is whether you want to spend more resources going down the potential outcomes path, to see whether problems can be overcome, or switch lanes early and go on a different path.

Ex ante we don’t know with certainty what will work best. Yet my _strong_ prior is to switch lanes (which I have done). However, to get others to switch lanes it might be desirable to (a) show them first that the traffic ahead is not going away soon, and (b) show them the empty lane next door.

I think it is different with hierarchical models. To continue with the analogy this is a case where there is traffic ahead but less traffic than in the other lanes. Thus it makes much sense to push ahead.

Fernando:

I think it makes sense to send vehicles down both routes. It’s just that, in this case, I thin someone could make direct progress implementing potential-outcomes and instrumental variables in Stan right away, whereas a lot more work would be required to get your approach going in Stan. Of course in the meantime you and others can continue to work on this stuff using other software, or someone could do the work to extend Stan or build a wrapper for Stan to do what you want.

Yes, I see your point about immediate progress.

Andrew, I don’t know if you already knew that but David Drukker is currently implementing potential outcome models with the teffects function in Stata.

http://www.stata.com/meeting/nordic-and-baltic13/abstracts/materials/se13_drukker.pdf (Didn’t find the slides from the most recent presentation in Hamburg but they seem to be mostly the same, anyway)

http://www.stata.com/manuals13/te.pdf

He’s most recently working on the implementation of estimation of quantiles of potential-outcome distributions

http://ideas.repec.org/c/boc/bocode/s457854.html

Daniel:

Thanks for letting me know. We’re also planning to implement a Stan link from Stata so it will be good for us to take a look at this at some point.

The Goldilocks metaphor really fits — and there’s no way around the problem. The old saw, “Know your audience” attempts to address it, but most audiences do include people with a wide variety of backgrounds. So we need to make compromises — and be tolerant when people don’t do it just the way we’d like it.

Long live Bayesian priors!