That comment and the overall post has made me less uncertain that it is really all about getting a good enough (for exactly what?) probabilistic representation (model) for how the data came about and ended up being accessible to you (the analyst).

(e.g. “how you should best represent what you plan to act upon prior to acting in the world. The “how one should best represent” to profitably advance inquiry being logic” http://andrewgelman.com/2017/09/27/value-set-act-represent-possibly-act-upon-aesthetics-ethics-logic/ )

Unfortunately, the bad meta-physics that supports the idea that there must/should be one perfect/true/best model leads many into an endless dizzying spiral towards the black hole of certainty – that can never be reached (as it presupposes direct access or correspondence to reality).

An example that comes to mind is Dennis Lindley’s quest for an axiom system for statistical inference as well as being overly reluctant to break from it https://www.youtube.com/watch?v=cgclGi8yEu4 (e.g. at around 18:45).

]]>As for the case where you’re pretty sure there is over-dispersion, the base model at zero may still do well (in particular it’s useful for the case where your sample doesn’t show much overdispersion). Alternatively, you might want to put the base model somewhere else in the space and build a PC prior off that. An example where this has been done for the correlation parameter in a bivariate normal distribution is here (https://arxiv.org/pdf/1512.06217.pdf).

I think you’ll be fine in this case as long as your tail isn’t too light. Maybe a Student-t-7 would be a good idea. But the end point is you should try a couple of priors and see how they go on some existing data. You should also simulate some data that you think is realistic, but isn’t near the base model and see how the prior performs. If there’s anything that I wish I’d bought out more in the post, it’s the idea that we shouldn’t be looking for the one perfect prior, but rather a set of “good enough” priors that we can compare and check.

]]>The exponential seems to do much better, as does the t with 3-7 degrees of freedom (although lower dof has a higher chance of giving divergences)

]]>In any case… We are sort of in a funny situation, where all prior information strongly suggests that the dispersion parameter is >0. Usually the over-dispersion parameter is estimated to be >0 even in relatively large studies, which seems logical to me when we look at medical events happening to patients and we do not put much information on the patients in the model. I suspect our prior clearly should not prevent the model from finding the case where there are no random effects, but one of the worries is really that the prior should definitely not favor it (or values of the over-dispersion parameter near 0) “too much”. Whatever that means, but in a sense there would be an inappropriately precise / insufficiently uncertain estimate of any treatment effect (if we are talking about a randomized controlled clincial trial), if we concentrate too much posterior mass near the value zero for the over-dispersion parameter.

I wonder what kind of prior would work sensibly as a weakly informative prior in this sort of setting… Half-normal (or half-T) on the untransformed dispersion parameter (=quite flat towards zero)?!

]]>> The first thing is that it should peak at zero and go down as the standard deviation

> increases. Why? Because we need to ensure that our prior doesn’t prevent the model from

> easily finding the case where the random effect^0 should not be in the model. The easiest way

> to ensure this is to have the prior decay away from zero.

That makes some sense in theory. But if there is any posterior mass in a small neighborhood of zero in the constrained space then there is mass out to negative infinity in the unconstrained space, and there will probably be divergent transition warnings from Stan in practice. So, it seems that you have to thread a needle where you are choosing a prior with a peak at zero in order to get a posterior whose mass is bounded away from zero but concentrated enough near zero that you discover that you are better off without that part of the model.

]]>Hodges and Clayton may redefine effects, but there is still the problem of “random”. What if the effects are deterministic but unknown? In general I prefer “unknown parameters” instead of “random parameters”. ]]>

Now, if you were distributing the total precision across the simplex, I would probably feel differently. An example of a model that does this (or something similar) is the Leroux model in spatial statistics, which I am not fond of. (See equation 3 in this paper https://arxiv.org/pdf/1601.01180.pdf)

]]>Dan seems to prefer a scaled simplex for the vector of K standard deviations, which is almost the same thing. In any event, the decov prior seems to work well and we have had approximately zero questions on Discourse or the old Google Groups site where people were having trouble fitting a model with stan_[g]lmer and the answer was to fiddle with the hyperparameters of the decov prior.

]]>