Comments on: Justify my love

By: Keith O'Rourke

Keith O'Rourke — Thu, 05 Apr 2018 12:40:54 +0000

> shouldn’t be looking for the one perfect prior
That comment and the overall post has made me less uncertain that it is really all about getting a good enough (for exactly what?) probabilistic representation (model) for how the data came about and ended up being accessible to you (the analyst).

(e.g. “how you should best represent what you plan to act upon prior to acting in the world. The “how one should best represent” to profitably advance inquiry being logic” http://statmodeling.stat.columbia.edu/2017/09/27/value-set-act-represent-possibly-act-upon-aesthetics-ethics-logic/ )

Unfortunately, the bad meta-physics that supports the idea that there must/should be one perfect/true/best model leads many into an endless dizzying spiral towards the black hole of certainty – that can never be reached (as it presupposes direct access or correspondence to reality).

An example that comes to mind is Dennis Lindley’s quest for an axiom system for statistical inference as well as being overly reluctant to break from it https://www.youtube.com/watch?v=cgclGi8yEu4 (e.g. at around 18:45).

By: Daniel Simpson

Daniel Simpson — Thu, 05 Apr 2018 11:28:23 +0000

In reply to Charles Driver. What I meant to say was that a light rail meeans you sing escape the heavy region around zero without a lot of data, so that region (the bulk of the prior) needs to be wide enough to contain all the reasonable parameter values.

By: Charles Driver

Charles Driver — Thu, 05 Apr 2018 09:43:35 +0000

In reply to Dan Simpson. You're saying the light tail strongly affects inferences around the heavy region towards zero? This is not so intuitive...

By: Dan Simpson

Dan Simpson — Wed, 04 Apr 2018 15:53:19 +0000

In reply to Björn.

I’m not sure if anyone has derived a Jeffreys’ prior for this model. I feel like it would be hard, but I’m genuinely not sure.

As for the case where you’re pretty sure there is over-dispersion, the base model at zero may still do well (in particular it’s useful for the case where your sample doesn’t show much overdispersion). Alternatively, you might want to put the base model somewhere else in the space and build a PC prior off that. An example where this has been done for the correlation parameter in a bivariate normal distribution is here (https://arxiv.org/pdf/1512.06217.pdf).

I think you’ll be fine in this case as long as your tail isn’t too light. Maybe a Student-t-7 would be a good idea. But the end point is you should try a couple of priors and see how they go on some existing data. You should also simulate some data that you think is realistic, but isn’t near the base model and see how the prior performs. If there’s anything that I wish I’d bought out more in the post, it’s the idea that we shouldn’t be looking for the one perfect prior, but rather a set of “good enough” priors that we can compare and check.

By: Dan Simpson

Dan Simpson — Wed, 04 Apr 2018 15:19:11 +0000

In reply to Charles Driver. This is a problem I've often encountered with a half-normal prior. The tails are too light so if the scale is even slightly wrong you are massively penalized for it. The exponential seems to do much better, as does the t with 3-7 degrees of freedom (although lower dof has a higher chance of giving divergences)

By: Dan Simpson

Dan Simpson — Wed, 04 Apr 2018 15:17:35 +0000

In reply to Ben Goodrich.

Yeah – this is a problem that you get when you transform bounded parameters. Honestly I don’t know how to deal with this intersection of prior specification and computation, but it’s definitely something needs thought.

By: Dan Simpson

Dan Simpson — Wed, 04 Apr 2018 15:15:45 +0000

In reply to Erin Jonaitis. A boomer?

By: Erin Jonaitis

Erin Jonaitis — Wed, 04 Apr 2018 14:18:49 +0000

In reply to Corey. Personally, I care about the topic but got distracted by wondering, what's the multiplicative inverse of a millennial?

By: Björn

Björn — Wed, 04 Apr 2018 11:24:23 +0000

Great post and very timely, because me and some colleagues were just having a discussion on priors for the over-dispersion parameter in negative binomial regression mdoels. In fact, another thing we contemplated is what the Jeffreys’s prior for such a model is. I know, I know, not a favored concept in this crowd, but I’d be interested if someone knows. One of the appeals to my mind is that the Jeffreys’s prior does behave reasonable well in a number of other situations. Additionally, I suspect it should permit an independent prior on the over-dispersion parameter, because I believe asymptotically the estimates of the rate and over-dispersion parameter are independent.

In any case… We are sort of in a funny situation, where all prior information strongly suggests that the dispersion parameter is >0. Usually the over-dispersion parameter is estimated to be >0 even in relatively large studies, which seems logical to me when we look at medical events happening to patients and we do not put much information on the patients in the model. I suspect our prior clearly should not prevent the model from finding the case where there are no random effects, but one of the worries is really that the prior should definitely not favor it (or values of the over-dispersion parameter near 0) “too much”. Whatever that means, but in a sense there would be an inappropriately precise / insufficiently uncertain estimate of any treatment effect (if we are talking about a randomized controlled clincial trial), if we concentrate too much posterior mass near the value zero for the over-dispersion parameter.

I wonder what kind of prior would work sensibly as a weakly informative prior in this sort of setting… Half-normal (or half-T) on the untransformed dispersion parameter (=quite flat towards zero)?!

By: Charles Driver

Charles Driver — Wed, 04 Apr 2018 10:05:43 +0000

In reply to Ben Goodrich. Yes, I did some limited experimentation with a half normal on the sd for '0 true variance random effects' at some point and it wasn't working as well as frequentist random effects. Then if I go to something 'more frequentist', like a normal(-1,10) on the log of the sd, sampling gets hairy and transitions get divergent.

By: Ben Goodrich

Ben Goodrich — Tue, 03 Apr 2018 22:13:16 +0000

I think this deserves more thought:

> The first thing is that it should peak at zero and go down as the standard deviation
> increases. Why? Because we need to ensure that our prior doesn’t prevent the model from
> easily finding the case where the random effect^0 should not be in the model. The easiest way
> to ensure this is to have the prior decay away from zero.

That makes some sense in theory. But if there is any posterior mass in a small neighborhood of zero in the constrained space then there is mass out to negative infinity in the unconstrained space, and there will probably be divergent transition warnings from Stan in practice. So, it seems that you have to thread a needle where you are choosing a prior with a peak at zero in order to get a posterior whose mass is bounded away from zero but concentrated enough near zero that you discover that you are better off without that part of the model.

By: Dan Simpson

Dan Simpson — Tue, 03 Apr 2018 21:10:34 +0000

In reply to Keith O'Rourke. Thanks Keith

By: Keith O'Rourke

Keith O'Rourke — Tue, 03 Apr 2018 20:49:41 +0000

Excellent post.

By: Dan Simpson

Dan Simpson — Tue, 03 Apr 2018 20:06:49 +0000

In reply to Ben Goodrich. I think the idea is that there is a lot of similarity between new and old style random effects. So it's an extension of an existing concept rather than a new one. Also these guys are biostatisticians, who are DRILLED on mixed effects models, so they're building from a common ground.

By: Ben Goodrich

Ben Goodrich — Tue, 03 Apr 2018 19:27:29 +0000

In reply to Aki Vehtari. I also came away asking "Why refer to new-style random effects as random effects?" instead of coming up with a new name for them, at least in a non-Bayesian setting. For posterior distributions, I think it is fine to refer to margins that are common to all groups and margins that are group-specific.

By: Aki Vehtari

Aki Vehtari — Tue, 03 Apr 2018 19:20:56 +0000

Excellent post, and I have just a small comment to footnote 0:
Hodges and Clayton may redefine effects, but there is still the problem of “random”. What if the effects are deterministic but unknown? In general I prefer “unknown parameters” instead of “random parameters”.

By: Sameera Daniels

Sameera Daniels — Tue, 03 Apr 2018 19:19:27 +0000

In reply to Dan Simpson. I was kidding. Which songs?

By: Dan Simpson

Dan Simpson — Tue, 03 Apr 2018 18:11:50 +0000

In reply to Corey. Thanks. What can I say, I love a theme and am not organized enough for my footnotes to be integers.

By: Corey

Corey — Tue, 03 Apr 2018 17:54:27 +0000

This is a fantastic post (up to and including the bold choice to use non-integer and negative footnote numbering (but not including the subsection headings (not that my opinion on them matters))).

By: Dan Simpson

Dan Simpson — Tue, 03 Apr 2018 17:34:43 +0000

In reply to Sameera Daniels. Not a girl and largely imaginary. There was an earlier draft before I added more section titles where it looked a lot like I'd just had a bad break up and I'd taken a time machine to 1998. I added some happy songs to balance :p

By: Sameera Daniels

Sameera Daniels — Tue, 03 Apr 2018 17:24:54 +0000

And the Lucky or Unlucky Girl IS?

By: Dan Simpson

Dan Simpson — Tue, 03 Apr 2018 17:19:59 +0000

In reply to Ben Goodrich.

Thanks for the clarification! I’m not surprised that it works (it’s pretty sensible!). My preference for distributing standard deviations directly is based on generalizability to situations where you don’t want the weights to be a priori exchangeable. This is much more interpretable on a standard deviation scale. But that’s really just choosing what to communicate. The prior comes from the same foundation, it’s just a different expression of the idea.

Now, if you were distributing the total precision across the simplex, I would probably feel differently. An example of a model that does this (or something similar) is the Leroux model in spatial statistics, which I am not fond of. (See equation 3 in this paper https://arxiv.org/pdf/1601.01180.pdf)

By: Ben Goodrich

Ben Goodrich — Tue, 03 Apr 2018 16:59:15 +0000

Slight clarification. The decov prior in rstanarm sets the trace of a KxK covariance matrix to be equal to K times the square of a scale parameter, which in turn has a gamma prior but the defaults are unit shape and unit rate so it is really a unit-exponential. In the case that K = 1, this is just unit-exponential on the standard deviation of a normal distribution, although the whole thing gets scaled by the standard deviation of the errors in the outcome in Gaussian models. If K > 1, then the variances are equal to a simplex vector multiplied by the aforementioned trace and you can put a Dirichlet prior on that simplex. There is also a (Cholesky factor of a) correlation matrix that gets an LKJ prior.

Dan seems to prefer a scaled simplex for the vector of K standard deviations, which is almost the same thing. In any event, the decov prior seems to work well and we have had approximately zero questions on Discourse or the old Google Groups site where people were having trouble fitting a model with stan_[g]lmer and the answer was to fiddle with the hyperparameters of the decov prior.

By: Carlos Ungil

Carlos Ungil — Tue, 03 Apr 2018 16:09:11 +0000

https://en.m.wikipedia.org/wiki/Young_adult_(psychology)

By: Anonymous

Anonymous — Tue, 03 Apr 2018 15:46:18 +0000

I’ve been trying to name this writing style. How about “stream of ALL consciousness”?