Transitioning to Stan

Kevin Cartier writes:

I’ve been happily using R for a number of years now and recently came across Stan. Looks big and powerful, so I’d like to pick an appropriate project and try it out. I wondered if you could point me to a link or document that goes into the motivation for this tool (aside from the Stan user doc)? What I’d like to understand is, at what point might you look at an emergent R project and advise, “You know, that thing you’re trying to do would be a whole lot easier/simpler/more straightforward to implement with Stan.” (or words to that effect).

My reply: For my collaborators in political science, Stan has been most useful for models where the data set is not huge (e.g., we might have 10,000 data points or 50,000 data points but not 10 million) but where the model is somewhat complex (for example, a model with latent time series structure). The point is that the model has enough parameters and uncertainty that you’ll want to do full Bayes (rather than some sort of point estimate). At that point, Stan is a winner compared to programming one’s own Monte Carlo algorithm.

We (the Stan team) should really prepare a document with a bunch of examples where Stan is a win, in one way or another. But of course preparing such a document takes work, which we’d rather spend on improving Stan (or on blogging…)

24 thoughts on “Transitioning to Stan

  1. I’m pretty much at the point of forgetting BUGS/JAGS language, having been using Stan exclusively for about 6 months. It’s not any harder to code and produces a game-changing speed-up in many settings, especially if you want to <a href="http://robertgrantstats.wordpress.com/2014/01/31/need-to-do-a-simulation-study-on-a-bayesian-model-use-stan/"do some kind of simulation study, where you can repeatedly run the compiled program for your model. There’s only enough space in my brain for a few languages, and R and Stata constantly jostle for space in data processing, regression models etc. Once my BUGS/JAGS cortex becomes run down, there’s little point in me trying to preserve it… so I would suggest taking the leap. rstan is a great interface too.

    If you want a collection of use cases but don’t have the time, you could always organise a Stan conference!

    • Robert: The link is broken but then I noticed one of your posts I wanted to comment on it and here as the issue is often discussed here – _forcing_ things that need to be corrected – to actually be corrected.

      It is this link http://robertgrantstats.wordpress.com/2014/04/08/including-trials-reporting-medians-in-meta-analysis/ and specifically the Hozo et al paper you pointed to.

      In my thesis http://statmodeling.stat.columbia.edu/wp-content/uploads/2010/06/ThesisReprint.pdf I made these comments after communicating to one of the authors

      “In the given reference, the example purported to demonstrate the value of such a
      method that was based on using medians and ranges of within patient change scores, but these
      were actually unavailable to the authors. Instead the authors “imputed” values using information
      on group mean scores over various treatment applications on different days, even when considerable
      drop out occurred [private communication].
      Furthermore, the non-parametric conversions they used were based on inequalities between
      sample medians and ranges and SAMPLE means and standard deviations which are only indirectly
      relevant – it is population means and standard deviations that are of inferential relevance.”

      So authors don’t make corrections their papers (it was keep on misleading folks.

      As for how best to include trials that report incomplete stats (or just not the stats you want) in a meta-analysis these days, just do ABC as in this case the posterior from ABC is actually the one you want (you only have the summaries you have) and if you just want the likelihood divide the posterior by the prior.

        • Bill’s got the link right. I think I forgot to close the href tag with a > and then there’s no going back!

          Keith – I think you are right with the most general approach to use ABC to get the pseudo-data and hence the stats you do want, but speed and tuning worry me a bit. Also I think there are a number of scenarios that crop up a lot which could be pre-programmed. I’m happy to consider other options though. I haven’t ruled out ABC and I’d like to come back to it. If you’re going to JSM this year, look me up, I’ll be presenting on it and we can thrash out the different approaches.

          Regarding Hozo et al, that’s an interesting insight. I like their inequality approach but it doesn’t interest me hugely in practical terms because there’s still a lot of uncertainty floating around – how could there be any less!

  2. I vote that the Stan team spend their time (1) making it easier to specify models with discrete parameters and (2)figuring out a way to handle missing data as super easily as JAGS. With those two changes, Stan would be a no-brainer first Bayesian model language to learn. Right now, I consider it a bit of a tossup.

        • there are some examples on the google discussion group.
          the 2cent wrap up is, using the built in distributions via ‘~’ is a little slow.
          to speed things up, you want to increment the log posterior manually
          this requires you to write down the formula and hard code it via stan’s syntax

    • I’ve been really hoping that the write-your-own posterior version of discrete missing data would become available soon but it does look like it will take a while as they are lacking a sufficiently interested party and the changes required are pretty deep (https://groups.google.com/forum/?fromgroups#!topic/stan-users/HL0TgFxlvLU). To me it looks like the workable alternative is to alternate Stan steps and discrete steps from R. That alternative is already workable but it should soon get much better as the Stan group does seem to be actively working on making it possible to restart a sampler without having to re-do the warm-up phase (https://groups.google.com/forum/?fromgroups#!topic/stan-dev/gnLLwEI4zN4). That alone would make it pretty fast.

    • No time frame for either (1) or (2).

      The first problem with (1) is that we don’t see a way to evaluate the probability of a discrete parameter in the posterior conditioned on everything else. We can do it by brute force evaluating the entire log probability function, but that would be very inefficient in most cases (though it wouldn’t need gradients). Options would include somehow marking up the relevant part of the model for the discrete parameter (something we’re thinking about in other contexts, too) or designing a language to provide the conditional, which would still be a lot of work compared to BUGS/JAGS.

      The second problem, which is related, is that we don’t know how to do the conjugacy analysis to do it efficiently. So we’d have to do something like slice sampling everywhere, I think.

      The third problem is that we don’t know how it interacts with HMC or how to adapt the slice sampler.

      Having said all of this about discrete parameters, marginalizing them out where possible leads to much more efficient sampling. I agree that it’s more complex on the modeling side. We’re going to try wrapping up popular use cases in functions, but we’re not sure how far we can get.

      As to missing data (2), that’s a whole different can of worms. We don’t have anything like N/A in the basic language — it’s just C++ doubles. So it’s more of an R concept than a Stan concept. But assuming we could pass in something like markers for missingness, there are issues with how to do the I/O and define the log probability function with the ad-hoc indexing based on run-time missingness. One alternative might be to define a more limited language that’d make it easier.

      • I’ve just been reading about Stan, and am looking to dive into its base code myself as this is a really interesting project. I see the scalability problem is nothing trivial here. And of course, being c++, you have invested a lot of developing time, though i would think performance should be rather spectacular. I’m surprised you’re hitting ceilings in the tens of thousands, though I don’t know what you’re benchmarking stan on and what the processing time even looks like.

        From what I could tell the work seems to be entirely done in memory – is that correct? What are your ideal goals about what hardward stan should be optimized for that weren’t evident on the stan landing site? Looks like a fun project.

  3. “We (the Stan team) *should* really prepare a document with a bunch of examples where Stan is a win, in one way or another. But of course preparing such a document takes work, which we’d rather spend on improving Stan (or on blogging…)”

    Yes, because doing that generates more usage/feedback, which is often the most valuable single thing in helping guide improvements that are truly useful beyond the team. There were numerous examples of this inside Bell Labs in the 1970s, as per the Small is Beautiful talk here, but certainly things like C, awk and I think S, among others.

  4. The BUGS software homepage has this warning; what exactly does it mean? Can someone elaborate?

    ….MCMC is inherently less robust than analytic statistical methods.

    • It means that you’ve got to check carefully that your MCMC method has worked for your problem (convergence, mixing, etc). If you have analytic expressions for posteriors, you don’t have this problem (also semi-analytic results, like a posterior with a 1D integral that you need to compute numerically).

      It’s the boilerplate warning on any complicated numerical method for solving difficult problems. (And it’s nice to see it explicitly written down!)

    • Noah:

      Yes, we can already handle such models (at least some of them). Just write down the joint density and go with it. Soon we will have functions implemented and this will make it easier to use such models as building blocks in more complicated systems. But you can already specify an AR model, either by directly writing y[t] given y[t-1] in a loop, or by constructing the covariance matrix and writing the multivariate normal.

      • But I have no idea how to code when there are missing values in time series… Is it possible? I think I need to specify missing values as parameters, but then how can I keep the serial nature of the time-series data???

        • Chen:

          This is discussed in the chapter on “Missing Data & Partially Known Parameters” in the Stan manual (that’s chapter 7 in the manual for Stan Version 2.4.0).

        • Ah, thank you very much, Andrew! OK, I am going to create a mixed vector that contains known and unknown values in a transformed parameters block, and use it in time series analysis.

  5. Is there a good tutorial specifically for installing Stan and RStan for a Windows PC? I currently use use gcc etc in Cygwin to compile source code, but I doubt that this approach will work for Rtools and Stan. My perception that the installation process for Stan on a PC being a convoluted one is what keeps me using JAGS for the time being. (My work computer is a multicore PC which is powerful enough for most things I wish to do, and is faster than the cluster machine running Linux which I tend to use only when memory becomes a serious issue.)

    • Rob:

      I just installed Stan on a new PC about a week ago and it went much smother than a year or so ago which involved a few emails to Stan group.

      Suggest you read the online install documents and if somethings goes wrong, email for assistance.

Leave a Reply to Rahul Cancel reply

Your email address will not be published. Required fields are marked *