## How to model distributions that have outliers in one direction

Shravan writes:

I have a problem very similar to the one presented chapter 6 of BDA, the speed of light example. You use the distribution of the minimum scores from the posterior predictive distribution, show that it’s not realistic given the data, and suggest that an asymmetric contaminated normal distribution or a symmetric long-tailed distribution would be better.

How does one use such a distribution?

You can actually use a symmetric long-tailed distribution such as t with low degrees of freedom. One striking feature of symmetric long-tailed distributions is that a small random sample from such a distribution can have outliers on one side or the other and look asymmetric.

Just to see this, try the following in R:
``` par (mfrow=c(3,3), mar=c(1,1,1,1)) for (i in 1:9) hist (rt (100, 2), xlab="", ylab="", main="") ```

You’ll see some skewed distributions. So that’s the message (which I learned from an offhand comment of Rubin, actually): if you want to model an asymmetric distribution with outliers, you can use a symmetric long-tailed model.

Of course, if you really want to blow your mind, try this:
``` hist(rcauchy(10000)) ```

If you’re like me, you’ll have to think a little bit about this before you see what’s happening.

1. Here is the model I fit last night after emailing Andrew about it (this refers to p. 143 in the BDA3 book; thanks for the quick answer, Andrew).

One puzzle remains: Martyn Plummer pointed out to me that a t-distribution with 2 degrees of freedom is not going to have finite variance (I didn’t check this yet, I thought that the Cauchy was a t-distribution with df=1 and that that’s the only one with variance undefined), so I should try it out with df=4. However, this doesn’t do the job; the only way I can get the speed of light data reflected nicely in the model is when I use a t-distribution with df=2.

Here is a fully worked example:

If you run the above code and replace df=2 with df=4, the distribution of minimum values from the posterior predictive distribution will not cover the observed minimum of -44. It only works with df=2 (or less, I guess, didn’t try that).

I should say that my real goal was to fit a “linear mixed model” in JAGS, where there were several extreme values. This is a similar problem to the speed of light problem on p. 143 of Gelman et al 2014, except in a hierarchical setting, and except that instead of tracking the distribution of the minimum values I am tracking the distribution of the maximum values.

So, what I did was just the same thing as in the example above: I generate y[i] from a t-distribution with dfs=2, with mu[i], and precision tau estimated using JAGS. The code I used is here:

https://gist.github.com/vasishth/7879095

I think that this illustrates a very cool aspect of bayesian modeling that would be difficult to achieve in a frequentist setting (using lmer, for example): you can flexibly change the underlying distribution to model the relevant properties of your data.

I guess I am suffering from beginner’s over-excitement syndrome.

• Andrew says:

1. There’s no problem using a distribution with infinite variance.

2. Use Stan. That’s the best way, looking forward to more complicated models and bigger datasets you might work with.

• Yes, OK. I’ll make the switch over to Stan during the Christmas break. Although I have to say that JAGS is very, very good for an entry into bayesian data analysis. Congratulations to Martyn Plummer for doing such a fabulous job.

• Andrew says:

I think Jags (and Bugs) are great. Without them, we might never have developed Stan. Stan is just newer, that’s all. I think that starting from scratch with Stan from scratch is as easy as starting with Jags or Bugs from scratch. It’s just that you’re already familiar with Jags (which is fair enough).

• > I think that starting from scratch with Stan from scratch is as easy as starting with Jags or Bugs from scratch

I wouldn’t go quite that far. There are good reasons for Stan to deal with types the way it does (e.g. with formal type declarations and type errors), but newcomers find some of it confusing. The differences between arrays and matrices are especially tricky (matrices can’t hold ints, row versus column major ordering, etc.).

I’m not criticizing Stan at all–there are clear benefits to all of this in terms of checking for errors and in terms of performance–but I do think it adds an extra layer before a brand new user can get their model up and running.

• Andrew says:

David,

I think it’s all about what you’re used to. When I programmed in Bugs/Jags, I’d often end up with bugs in my models having to do with all the looping required, as well as awkwardness such as the parameterizations of the normal distribution that would require many lines of code for a single part of the model. Stan is more direct that way, also Stan allows print statements which makes debugging easier. I agree that Stan has its own peculiarities too. In either Stan or Bugs/Jags, there is a large number of preprogrammed models that you can get running right away; after that it’s all about where you can go from there, and I think that, at first, it’s always seems easier on the system you’ve already learned.

• I’m a good baseline for this issue of usage and switching: if I can do something, then anyone can do it. Even my 7 year old son has formally established that I am an idiot.

I am currently teaching bayesian statistics to a graduate student class in linguistics. By a process of natural selection, I only get the very best and most dedicated (and fearless) students in my class. I doubt, however, whether I could get them to get enthusiastic about Stan. The installation instructions are pretty obscure and will irritate them. I would have appreciated a short version like this one (below) on the opening page of Stan instead of being buried in Appendix B of the manual.

“Installation instructions (Mac): We assume you have g++ version xx and whatever the other thing was version yy. If you don’t, click here to find out how to install these (actually, here, I would lose some people already). Download this tar.gz and put it in some location (any location) in your computer. Then, as super user, install the package inline and (I forgot what the other one was). Finally, install RStan using the following commands (as given).”

I should mention that I managed to install Stan easily, so if I can do it, anyone can. But the instructions should be up front, and brief, on the opening page. IMHO, JAGS has a much easier installation procedure (or maybe I forgot about any pain I suffered during installation)!

The other thing is, most of the good books that provide an entry into practical aspects of bayesian methods, such as Gelman and Hill 2007, and Lunn et al 2012, provide BUGS code, and I think the beginning student has enough to worry about without having to figure out how to translate this published code to Stan.

BUGS seems to be historically entrenched in bayesian courses; pretty much any course out there (including the one I’m doing at Sheffield) uses WinBUGS (which I consider to be a horrible pain; in particular, I find the model files physically painful to look at and work with).

Stan seems to be aimed more at the more experienced intermediate or advanced user, i.e., people hitting a wall with JAGS because their datasets are getting bigger and/or their models are getting more complicated.

I do appreciate the fact that Stan has a huge number of models available for download. One thing that would help is to put up a complete minimum working example, with a link the relevant data right next to it. E.g., I have to first look at the radon model in Stan, then go and download the radon data from the ARM website. It will make new users’ lives much easier if everything is in one place. Maybe I’m just lazy that way (no, I *am* lazy). But I think that I’m a whole lot less lazy than a lot of people I meet in my life.

• Andrew says:

Shravan:

We can definitely do better to make Stan accessible and I appreciate your comments.

One thing, though: You write, “Stan seems to be aimed more at the more experienced intermediate or advanced user, i.e., people hitting a wall with JAGS because their datasets are getting bigger and/or their models are getting more complicated.” I think this just has to do with Bugs/Jags coming first so there are already people used to it. If you were to learn Stan first, I think there’d be an equal if not greater difficulty if you then wanted to switch to Bugs or Jags. To put it another way: Bugs deserves full credit for coming first, and Jags deserves full credit for coming second. Stan is the third program of this type, and it’s only fair that it is compared to what came before. But I do think that many of these ease-of-use issues are coming conditional on people (not just users, also teachers and textbook writers) already being familiar with the earlier packages. I think this will change. For example, BDA has a Stan appendix, and the new version of ARM will also be in Stan.

• Andrew,

Those are three very good points (loops, parameterization, and print statements). Also, thanks for the reminder (in another comment) that BDA3 has a Stan appendix. I should go read through it tonight.

• @Shravan: thanks for the feedback. We’re always looking to understand where our users get stuck so we can make things easier on them.

Our longer-term plan is to update all the books on practical Bayesian modeling to use Stan :-) We’ll probably write our own textbook before we complete upgrading everyone else’s. For example, we’ve been helping some people working with Erik-Jan Wagenmakers update his book on Bayesian cognitive modeling with Stan models (though the text will still use BUGS models, presumably). We’ve already done almost all of the BUGS models. And the models from Gelman and Hill. That doesn’t solve the problem you point out that the texts themselves refer to the BUGS models. We hope it’s helpful in transitioning.

The underlying reason for Stan’s install complexity is that we need a working C++ toolchain for compiling user-specified models. BUGS and JAGS are both interpreted. Also, I think you’re commenting on our modeling-language manual, not on the RStan getting started page, which is where we want people to go to install RStan. The problem we have there is getting people to do the first steps we list and cross-reference under pre-reqs.

Our current plan is to move the command-line instructions out of the modeling language manual into their own manual. At that point, it’ll make much more sense to put them all up front as we’ve done with the Python and R install instructions. The getting started with R page does have the kind of instructions you want, so maybe you were looking in the wrong location?

JAGS didn’t start with a Mac installer in the form of a .dmg file — a user contributed that. We’d be more than happy to host such a contribution. We’ll probably eventually get around to doing one ourselves, but it’s going to be trickier than for JAGS because of the need to install C++ tools for users that don’t have them. The R2jags and R2WinBUGS programs are also tricky to install in R. Rather than Stan’s dependence on a C++ toolchain, they depend on a previous system install of JAGS or BUGS. A year or two before I started working with Andrew, I completely failed at installing R2jags and had to come up to Columbia to beg one of Andrew’s grad students for help.

As far as the typing being more complex, that was a tradeoff we went into with our eyes open. I agree it’s a bit trickier to write than in BUGS/JAGS (assuming you can figure out their use of indexing and when you need brackets — I always have to look that up and I think it varies between JAGS/BUGS). But as Andrew pointed out, our intent was to make the models more explicit and hence more readable and to catch specification bugs at compile time and data sizing issues at data-load time. I’ve already been once around this block back when I was a computational linguist working on ALE — we got rather large amounts of pushback from the LFG and PATR crowd about the typed nature of ALE.

A major design decision where I was expecting substantial pushback that never came was in declaring data vs. parameters. In BUGS/JAGS, this is an implicit run-time decision, whereas in Stan it’s part of the model. I find this is immensely helpful in reading a model and its intent, but it does come at the cost of making a single model less usable for different configurations of known vs. unknown variables.

This means missing data problems are harder in Stan than in BUGS/JAGS, both of which support vectors that are partly data and partly parameters (data is known values, missing parameters given as NA).

Discrete models are also relatively painful in Stan. These require more sophistication about statistics to implement in Stan than in BUGS/JAGS because we don’t have discrete samplers, and so discrete parameters need to be marginalized out. There’s good motivation for doing that in terms of sampling efficiency, especially in the tails, but it puts a heavy burden on novice-to-intermediate math stats ability. One place we feel this keenly is in mixture models and missing discrete data. I think the right thing to do for Stan is to add a higher-level mixture abstraction, but that’s not even on our short-term to-do list (but it could be on yours or someone else’s who wants to contribute).

We’re a bit cagey about republishing people’s data on the Stan web site. I had the same problem when I was working on LingPipe (my open-source project between ALE and Stan), where we didn’t have permission to redistribute most of the data we used in tutorials (blame the LDC, NIH, and academics who are prone to distribute data under “academic use only” licenses). We could probably distribute Andrew’s data for ARM assuming we distributed it with whatever licensing and other caveats that came with it. On the whole, though, we’re shying away from including anything that doesn’t come with clear open-source redistribution terms.

We need more time in the day to do all of these things!

• @Bob

Thanks for the insights into your process. I always find stuff like this fascinating, and I think you have an excellent perspective on the tradeoffs you’ve made. Keep up the great work!

2. Corey says:

You could also do something like use a skew-normal or skew-t error distribution with a prior for the skewness parameter that induces a symmetric marginal data distribution (a.k.a. prior predictive distribution). Working out the right priors looks to be either a pain in the ass, or if you’re like me, fun.

3. Ben Bolker says:

For what it’s worth, AD Model Builder (http://admb-project.org/‎) allows a good deal of (frequentist-style) flexibility in constructing hierarchical (and other) models with arbitrary distributions, although you do have to “roll your own” more than with (say) lme4. Learning curve is probably similar to WinBUGS/JAGS/Stan if starting from scratch.

• ADMB does auto-diff, too. It was among the systems we studied when designing Stan.