Prior choice recommendations wiki !

Here’s the wiki, and here’s the background:

Our statistical models are imperfect compared to the true data generating process and our complete state of knowledge (from an informational-Bayesian perspective) or the set of problems over which we wish to average our inferences (from a population-Bayesian or frequentist perspective).

The practical question here is what model to choose, or what class of models. Advice here is not typically so clear. We have a bunch of existing classes of models such as linear regression, logistic regression, along with newer things such as deep learning, and usual practice is to take a model that has been applied on similar problems and keep using it until it is obviously wrong. That’s not such a bad way to go; I’m just pointing out the informality of this aspect of model choice.

What about the choice of prior distribution in a Bayesian model? The traditional approach leads to an awkward choice: either the fully informative prior (wildly unrealistic in most settings) or the noninformative prior which is supposed to give good answers for any possible parameter valuers (in general, feasible only in settings where data happen to be strongly informative about all parameters in your model).

We need something in between. In a world where Bayesian inference has become easier and easier for more and more complicated models (and where approximate Bayesian inference is useful in large and tangled models such as recently celebrated deep learning applications), we need prior distributions that can convey information, regularize, and suitably restrict parameter spaces (using soft rather than hard constraints, for both statistical and computational reasons).

I used to talk about these as weakly informative prior distributions (as in this paper from 2006 on hierarchical variance parameters and this from 2008 on logistic regression coefficients) but now I just call them “prior distributions.” In just about any real problem, there’s no fully informative prior and there’s no noninformative prior; every prior contains some information and does some regularization, while still leaving some information “on the table,” as it were.

That is, all priors are weakly informative; we now must figure out where to go from there. We should be more formal about the benefits of including prior information, and the costs of overconfidence (that is, including information that we don’t really have).

We (Dan Simpson, Michael Betancourt, Aki Vehtari, and various others) have been thinking about the problem in different ways involving theory, applications, and implementation in Stan.

For now, we have this wiki on Prior Choice Recommendations. I encourage you to take a look, and feel free to post your questions, thoughts, and suggested additions in comments below.

50 thoughts on “Prior choice recommendations wiki !

  1. I think this is a typo in the first section:

    “Specific informative prior: normal(0.4, 0.2) or whatever. Sometimes this can be expressed as a scaling followed by a generic prior: theta = 0.4 + 0.2*z; z ~ normal(0.4, 0.2);”

    should be “… z ~ normal(0, 1);”?

    Also this page looks helpful!

  2. The prior choice wiki is fantastic–and quite enjoyable reading.

    One thing I’ve only recently started getting a handle on is how poorly chosen priors interact with the model, especially for more poorly identified parameters. And so I’ve changed a step to my workflow to hopefully catch these issues earlier on. The basic idea is to draw some “known” parameters from the prior, then simulate some data given conditioning information and the “known” parameters, and estimate my model on the simulated data.

    That’s not revolutionary. What is really cool is how easy it is in Stan to do this well for extremely complex models—where writing the data generating process out in R, and keeping a handle on whatever changes you make to your model—is really difficult.

    All you do is something like this:

    data {
      int N;
      vector[N] Y;
      int sim; // indicator for whether we want to run the model in simulation mode, in which case Y won't affect the parameter estimates. Use this to generate fake data
    }
    parameters {
      real mu;
      real sigma;
    }
    model {
      mu ~ normal(0, 1);
      sigma ~ cauchy(0, 1);
    
      if(sim==0) {
        y  ~ normal(mu, sigma);
      }
    }
    generated quantities{
      vector[N] y_sim;
      for(n in 1:N) {
        y_sim[n] = normal_rng(mu, sigma);
      }
    }
    

    Now if I run this model with sim= 1, it will generate many fake datasets for me _without conditioning the parameters on the data_. That’s the same as drawing from the priors and simulating some outcomes. I can use any of these fake datasets to check to see if my model is working nicely. I can also check to see whether the outcomes across draws are stupid, in which case my priors or likelihood or structure is stupid. And most importantly, if I make a change to my model, I don’t have to keep track of two sets of code (one data generating process in R, one estimation model in Stan). I have no idea why I’ve not been doing this forever.

    Using this approach I found that for a recent model I’ve been working on, a lognormal prior was insane while a truncated normal prior was sensible. Saved me a lot of grief.

    • Yep, the prior and the data model imply prior predictive data distribution… and if you have a better idea of what your data is likely to look like than you have of what your prior should look like directly, you can tune your priors until your data looks like something that you really do think your data might look like.

      Similarly, I’ve used a scenario where I have some function f(x, P) where P is a parameter vector, usually a basis expansion weight vector. I don’t know much about how I should move the various components of P around in such a way as to produce the kinds of shapes I want… but I do know something about the kinds of shapes.

      If I put flat priors on components of P, and then weight them all by a likelihood over some observed statistic about the shape of f… such as

      m = integrate(f(x,P),x,-inf,inf);
      m ~ normal(0,10);

      integrate((f(x,P)-m)^2,x,-inf,inf) ~ gamma(5,5/10.0);

      and then generate prior predictive functions f and graph them… I can induce a sufficiently good prior on P as to constrain the functions to be within a certain range of shapes, and then apply data and get updated distribution over the P parameters.

      This is my hack to get around the fact that gaussian process regression on many data points is computationally hugely expensive.

      • Exactly. I’ve been working with high-dimensional discrete choice models lately (into the thousands of choices), where even a moderately wide prior on latent quality/demand can result in one choice having close to 100% probability. Without that simulation step (over not one draw from the priors but many) you really don’t learn to see how implausible some prior scales are.

  3. About the last dot of the “5 levels of priors”. I reckon…

    “theta = 0.4 + 0.2*z; z ~ normal(0.4, 0.2);”

    …should be

    “theta = 0.4 + 0.2*z; z ~ normal(0, 1);”

    Or am I completely lost? I usually am! Ah, well, so what, at least today I had the fortune of getting to have some carrot pancakes.

  4. This is great!

    Will the wiki have a discussion board? I have a question regarding finding a general method for coming up with a set of priors in what I believe must be a very common situation, and I think the wiki would be great place to pose the question.

    Or should I pose the question here? Or is the wiki not taking questions (totally reasonable; everyone has finite bandwidth in this world!)?

  5. If you’re going to say something like “we don’t like entropy” you ought to at least hint as to why. Think about user who is a student, or a mathematician.

    • Dan:

      I guess so. Just to clarify: the primary audience of the wiki is us, as w try to establish general principles and recommendations. We’re not so interested in talking about things we’re not going to do. Whatever advice makes it way into a general paper, we will try to be more clear on.

      • Thanks for this great wiki!

        I totally understand your point of view, however, I would love to understand the cons. So please do this paper – or maybe add an optional, supplementary wiki page about this topic?

        Best wishes
        Jens

  6. I have the impression that you despise Bayes factors, but seeing as Bayesian model selection/hypothesis testing is more sensitive to the prior distribution of a parameter than Bayesian parameter estimation is, perhaps you could write a section or so about how these recommendations fare with regards to Bayes factors? I, as someone who’s working within good ol’ classical p-value psych, and just now learning Bayesian statistics, would find that immensely helpful (my peers can’t fathom something unless it has a p-value, or at the very least a Bayesian equivalent, next to it).

    • Hi Anonymouse, I understand your quandary. Although you can find things that seem like Bayesian equivalents to p values they don’t really work the same way and it’s dubiously useful anyhow. I recommend shifting from a mindset of significance testing to estimating effects/parameters and uncertainty. If explained clearly I’ve had some success with this approach given similarly trained audiences. The key as we discussed here recently is to try and undermine the binary yes/no thinking that has come to dominate statistical education…

  7. A mistake I have seen quite sophisticated people is to focus prior elicitation on aspects about which the data is going to speak volumes and ignore aspects that could benefit from careful elicitation.

    A good example is a problem where you are interested in two rate parameters of Bernoulli trials \theta_1 and \theta_2. Say you are comparing a treatment and control.

    I find a lot of people like to elicit the marginal prior p(\theta_1) and p(\theta_2) where IMO its usually better to leave these marginals flat and impose a correlation in p(\theta_1,\theta_2) – and to carefully elicit the degree of correlation. I have had some frustrating conversations where people have correctly pointed out that I ignored prior belief about the marginal without them realising that the likelihood already speaks volumes about it and it isn’t too important – in contrast the encoding some joint prior information is quite useful.

    I guess I found it hard to succinctly make a case for why I “left some prior information on the table”.

    [Also, I acknowledge that if this were a sub-model in a larger hierarchichal model then the marginal likelihood of the sub-model would be relevant and you can’t ignore a careful elicitation of p(\theta_1,\theta_2)]

    While I am here, thanks for Stan – it isn’t an exaggeration to say it speeds up what I can do more than a thousand times – which is easily the difference between being able to do something or not in an applied setting.

    • David:

      1. In your example, it could also make sense to reparameterize, for example letting theta be the rate for the control, and theta + delta being the rate for the treatment, then putting priors on theta and delta.

      2. I’m glad Stan has been useful to you. We’ve worked so hard on Stan and we love to hear that it helps people.

  8. In my opinion, handling weakly-informative priors is a bit more cumbersome than it could be in Stan, especially when dealing with parameters in physical units. It would be much more natural to have scale-full parameters, but tell Stan the typical range of each parameter. This is especially true since priors go in the Stan script but initialization is banished to the driver.

    A example where I have had bugs is “real melting_temperature”, where the melting_temperature could have the weakly-informative prior “melting_temperature ~ normal(320.,40.)”. I have had such models behave pathologically when I fail to initialize such variables. It would centralize the information to specify “real melting_temperature”, and Stan could use the typical_range information for initialization. Even nicer if I could specify “real melting_temperature” and have Stan implicitly add a weakly-informative prior to the log likelihood.

    Currently I can achieve a similar effect with an extra line for parameter transformation, but it clutters both the Stan scripts and the automated reporting after the runs. I think a lot of what you want to achieve with weakly-informative priors can be achieved just by having the user tell you what range of each parameter is reasonable or typical (especially when the user is a scientist who is intimidated by the idea of priors but knows exactly the reasonable range of physical quantities). The entire “Generic prior for anything” section could be subsumed by this (apart form choosing whether to use heavy tails).

    • John, I recommend taking the extra effort to convert your problem to dimensionless form. So for example just ask people for the known melting point of some similar compound or use the boiling point of water or recast the model in terms of kinetic energy instead of temperature and compare the kinetic energy to the heat of fusion of some known compound or whatever.

      There are lots of good reasons to do this, one is that it helps stabilize the Stan model by making the mass matrix have unit scale, another is that often you can actually eliminate parameters or eliminate correlations through dimensional analysis. See for example the book “Scaling” by Barenblatt for lots of insight

    • That is exactly what Andrew’s been saying in the Stan developer meetings. We are considering some alternative syntax exactly to let users indicate scales and locations. It is very tied up with using scale-location types of priors and it has the downside of mixing variable declarations and the density spec together. Still no plans.

      One thing I have wanted to add for years is an init block. Again, mixing this with parameter declarations directly would be messy. This became much lower priority in Stan 2 when Michael improved stability of warmup. If warmup were better we would not need scaling for efficiency (still need noncentered params).

      • Even if you didn’t do the automatic scale-location prior, you could still have the typical_range in the variable declaration. The semantics would be a implicit linear variable transform, just like you already have implicit transforms for hard constraints (lower=0, etc.). It would help both initialization and the mass matrix while keeping it clean for the user. As an aside, I consider (lower,upper) more natural than (location,scale), and the former is less likely to make the user think a prior is added automatically. Another knock-on benefit would be to add a few horizontal lines in the traceplot for the user-specified typical region, so the user can see unphysical behavior at a glance.

    • You are confusing prior specification with computation.

      A certain class of weakly informative priors require defining a scale separating reasonable values and unreasonable values. For example, “100 C is possible” where as “if we saw 1000 C then we really didn’t understand the experiment”. In both cases “0 C” is taken as the base value for which there is no effect. From this perspective, a normal(320, 40) prior would not be considered weakly-informative but rather straight up informative. There are important technical details here that we are still working out, but do not confuse “weakly informative priors” with just “specifying ranges”.

      We advocate rescaling the problem in natural units such that the scale separating reasonable and unreasonable is of order 1 (with 0 still identifying the base model). In that case the user specifies weakly-informative priors that look like defaults, for example N^{+}(0, 1) or C^{+}(0, 1). But, again, they’re not defaults until the user has specified the right units.

      Now here’s the important issue — the scale in a weakly-informative prior need not be related to the length scales in the posterior distribution! The likelihood typically induces nontrivial curvature in the posterior unrelated to the prior scales, and to ensure efficient optimization one has to identify those induced scales. This can almost never be done analytically which is why we rely on adaptation routines to learn them empirically.

      • I agree that weakly informative priors are not just ranges; in particular tail behavior matters.

        I just noticed that the blog ate my “lower=0, typical_range=(280,360), add_prior_from_typical_range=true” after the real because I put it in angle brackets. There is no way for Stan to guess any of this, but it could be easier to specify an implicit linear transform in the variable declaration. I am really advocating telling Stan the units of the variable, and then having an easy method to add one or another standard prior based on those units.

        That being said, I am a bit confused by the line between weakly-informative and just informative. Perhaps the issue is with my example. I was thinking in Kelvin, so that [273,373] is the range between freezing and boiling of water, and I am thinking of the unfolding (“melting”) temperature of proteins. In this case, a prior that specified that the unfolding temperature of a protein is typically between the freezing and boiling of water would be considered weakly-informative, right?

        Yes, I exactly make the curvature point in another comment.

        • No I think a prior specifying that the temperature is less than the surface temperature of the sun is “weakly informative”, a prior specifying that the temperature is probably less than 10^9 kelvin is “uninformative”, specifying that the parameter is normal(320,40) is actually “informative” this is real solid information you’re putting in.

          In my recommendation I wasn’t just referring to rescaling the variables to be order 1, though I fully agree with this, I was also referring to doing the job of eliminating redundant parameters through dimensional analysis, a somewhat more involved but very useful and informative process.

          For example, suppose you have two proteins and the main question of interest is how they compare. You could specify a model where you give each protein’s melting point a prior, and do inference on each one, and then use posterior samples to compute T2/T1 for example.

          Or, you could specify that you are measuring temperature in units of the melting point of protein 1, and do inference directly on the *single* parameter Rt = the ratio of the two temperatures. To the extent that this is the main thing you care about, the problem is now a uni-variate problem instead of a bivariate problem with strong nonlinear correlated geometry in the 2 dimensions.

        • I think this is just a case where I used an example where I have too much subject knowledge. In proteins, 200 C might as well be the surface temperature of the sun. See for example https://www.ncbi.nlm.nih.gov/pmc/articles/PMC23458/, where 200 C melting temperature is considered an extremely high temperature.

          I am unsure whether taking the ratio of the variables increases efficiency unless the likelihood strongly depends on that ratio over the individual variables. In the case of independent experiments with no shared parameters, I can’t see how the ratio could improve sampling. In cases where the ratio is controlling or has variance reduction or has a more natural prior, then it should certainly be used.

        • Also, speaking of standardizing variables, the user cannot always easily do the rescaling in case lower or upper bounds are involved. The proper linear transform should occur in the implicit unconstrained coordinates. In the case “real(lower=0) temperature”, I should do “real(lower=0) std_temperature; temperature = 319 * std_temperature^0.156” so that [-1,1] in the implicit unconstrained coordinates corresponds to [273,373] in the temperature. This is a bit of work and requires an additional jacobian correction since it is nonlinear. It would take even more work if the variable had both lower and upper bounds.

        • Somehow it feels like you’re overthinking things, what’s wrong with:

          parameter {
          
             real melttemp_nd;
          }
          
          transformed parameters {
             real melttemp_kelv;
             melttemp_kelv = melttemp_nd * 320;
          }
          model{
             melttemp_nd ~ normal(1,.1);
             measuredtemp ~ normal(melttemp_kelv,meas_err);
          }
          
        • Nothing is necessarily wrong with it, although it takes a bit of math and information from two possibly far away source lines to note that 0.1*320 = 32, increasing the likelihood of a stupid bug that invalidates your run. Also, it is taking you 4 separate source lines in 3 separate blocks just to express “I want a temperature parameter that is probably but not certainly in the range (273,373)”.

        • The dimensionless ratio, melttemp_nd is really the *true* parameter. Dimensionally you’re always free to measure in whatever unit you want, Kelvin, microKelvin, Farenheit, Celsius, decaCelsius …volts output by my thermistor circuit, whatever… This could change the scale of your parameter by 10^36 if you like… whereas the fundamental quantity is unchanged.

          The thing about using dimensional units is that experimentalists can understand them because they read them off their instruments… the issue is thorny for communications purposes, but very clear from a mathematical modeling standpoint, the underlying meaningful quantity must be a dimensionless one because the free choice of units can’t affect the physics.

        • Daniel, where is the dimensionless ratio melttemp_nd coming from? It seems to be the ratio of the temperature of interest divide by the arbitrary value “320K”. You could also divide by 1K and say that’s the *true* parameter, right?

        • Carlos: the problem is that I don’t really know the details so I can’t be more helpful. In all likelihood the “true” parameter of interest is something like the melting temperature of one protein as a fraction of the other, or the melting temperature of the protein times the boltzmann constant as a fraction of some binding energy of a certain molecular bond, or something.

          In my example case, the “true” parameter is the actual temperature as a fraction of the maximum a-priori guess for the temperature, expressed in whatever units you like. The point is it has to be free of some arbitrary unit chosen by a metrology committee in order to be internally consistent.

          In most cases what is needed is a choice of a scale parameter based on some specific outside information, so that the parameter either becomes defined to be 1 (in which case you don’t need to estimate it, but it automatically sets the scale of some other parameter in your model), or becomes O(1) by construction because you chose a reference that is about the same size.

          “Eliminate arbitrary symmetry and make everthing O(1)” is the motto in dimensional analysis.

  9. Dimensionless forms can have advantages but working with dimensionless variables can mask unphysical behavior because you don’t know if T_scaled = 0.8 is reasonable or not for a protein melting transition. I wrote a protein simulation program that works in “natural” temperature units, and it is has been problematic for experimentalists’ intuition when working with the program. When problems show up, it is often because they simulated protein folding at -30 C. While normalizing everything to be on [-1,1] makes sense if you are a consulting statistician, it is not a good idea for a subject expert who is much better off working with the same quantities as they work with on a daily basis.

    I don’t think I agree with you about stabilizing the model in this case, excluding critical or power-law phenomena. The appropriate step size for Bayesian sampling depends on the curvature of log likelihood. This is covariant with the variable units, but also depends on the number of samples and the particulars of how the model depends on each parameter. Additionally, Stan’s autotuning step size is per-parameter, making the sampling efficiency independent of parameter scale, right? Eliminating correlations is a separate issue from parameter scales.

    If standardizing to unit-ish scale around the origin is such a good idea, why not have a feature so that Stan can do it for me? I would say it is just a bias of Stan to implicitly assume that [-1,1] is the natural scale for all parameters, especially during initialization.

    • I think you meant to say curvature of log *posterior*- I mention because choice of priors can have big impact on MCMC sampling. Also, my understanding is that yes, the tuning of kinetic energy distribution in HMC is intended to decorrelate and rescale the target distribution to facilitate exploration. However, there is no perfect method for doing this, and the further the target is from the idealized multivariate gaussian the harder to accomplish. So it can be worthwhile to reparameterize model to make this more efficient (in practice, this is often necessary to suppress divergences for instance).
      But this is a question for the real experts like Andrew or Michael Betancourt so I’ll leave it there…

      • That’s right. The problem is that Stan’s initialization and regularization for estimating the inverse mass matrix is unit scale. We want it to be posterior covariance, but it is a chicken and egg problem to find typical set, mix in typical set, and estimate covariance. By default we will only use a diagonal mass matrix—it is hard to estmate a dense covariance and can tank HMC if the posterior curvature varies. Riemannian HMC does what you are thinking about but is costly per iteration. I’m excited by the prospect of using the posterior Hessian to estimate the mass matrix in Euclidean HMC. My next big project is to whip our higher order autodiff into shape. That will give us Riemannan and let us start exploring better parallel adaptation schemes.

        • Yes, I should have said curvature of the log posterior.

          Riemannian HMC certainly works if you can do it, but seems very hard to apply outside of few parameter cases (I think I remember it being O(N_param^3)). Are there variants of Riemannian HMC that work in high dimensions, maybe by assuming some sort of diagonal plus low rank structure on the mass matrix?

        • Where would you evaluate the Hessian – at the posterior mode? That seems to run counter to the whole concentration of measure problem. By the way, thanks for all the work you (and Michael Betancourt) have put into making case studies and tutorials lately! They are really great- pitched at just the right level for math-curious practitioners like me to get a better idea of what is going on under the hood…

        • Can’t do that—many of the models we work with don’t even have posterior modes. In a multivariate normal, it doesn’t matter where you take the Hessian—the result is always the precision (inverse covariance matrix). If curvature varies around the posterior, then we want an average (and will have to take smaller step sizes to account for not having a good global mass matrix).

          Right now, we take the draws and make a regular covariance estimate to get the inverse mass matrix. There has to be something more we can do if we have the Hessian.

        • You don’t just have the regular samples, you also have the gradient vector at many points. Could you more quickly estimate a good mass matrix by linear regression of the displacement of the sample from the mean against the gradient of the log likelihood at that sample? In the multivariate normal case, this procedure is exact after N steps where N is the number of dimensions.

        • Yea that makes sense – I was indeed thinking of cases where posterior departs strongly from ideal MVN scenario. How would using the Hessian (say over several evaluations where curvature varies) differ from the updating and tuning of the inverse Euclidean metric that is already done in Stan?

  10. Even if you didn’t do the automatic scale-location prior, you could still have the typical_range in the variable declaration. The semantics would be a implicit linear variable transform, just like you already have implicit transforms for hard constraints (lower=0, etc.). It would help both initialization and the mass matrix while keeping it clean for the user. As an aside, I consider (lower,upper) more natural than (location,scale), and the former is less likely to make the user think a prior is added automatically. Another knock-on benefit would be to add a few horizontal lines in the traceplot for the user-specified typical region, so the user can see unphysical behavior at a glance.

    • I think you’re suggesting the same thing as Andrew wanted with a location and scale. And that’s easy to explain to users.

      If you want hard upper and lower bounds, we already have that. But we general don’t recommend them if it’s not a hard constraint on the model’s logic.

      Stan defaults to a uniform prior over parameter values meeting the constraint, which may be improper, but would be proper with lower and upper bounds.

      How would a soft lower/upper bound translate to a rescaling?

      • I will use parentheses instead of angle brackets in the pseudo-stan code to avoid the blog’s ire.

        I would propose that soft constraints behave in the following way:

        “real(soft_lower=300, soft_upper=400) x” would be equivalent to “real x_std; x = 350 + 50*x_std;”. The x_std would be an implicit variable, exactly the same mechanism you use to handle hard constraints (but obviously this does not change the expressiveness of the model). Then, when you initialize x_std in the normal way, x will get initialized around the proper range. Additionally, the regularization of the mass matrix would behave naturally regardless of variable scales, since x_std is around [-1,1]. Even better if I could say something like “real(soft_lower=300, soft_upper=400, add_weak_prior=true) x”, which should only be valid if soft ranges are given.

        In the case of constrained variables, this should be a linear transformation before the nonlinearity is applied. “real(soft_lower=300, soft_upper=400, lower=0) y” would be equivalent to “real y_std; y = exp( 0.5*(log(300)+log(400)) + 0.5*(log(400)-log(300))*y_std)” with the appropriate Jacobian correction. This similarly puts the variable around the correct range and never violates the constraint. You could still have the add_weak_prior functionality, applying to y_std.

      • Bob, I’d be all over the idea of an “init” block right after the parameters block. One of the main reasons I think this is a great thing is that when you init externally, you have to do so in a way that isn’t connected to the code, so when you change the dimensions of things, or you reparameterize etc you’re outside your stan code trying to remember what’s inside the code, and failing. Also, the fact that you have to give separate inits for each chain is annoying, whereas inside stan this is implicit, each chain is initted by the same mechanism.

        In this block you’d allow the use of the _rng functions, and you could easily init the vars according to the priors. I would suggest though that it be possible to pass in external inits and if this is done, to skip the in-code init step. That would help with initting from one of the samples from a previous run.

Leave a Reply

Your email address will not be published. Required fields are marked *