Point summary of posterior simulations?

Luke Miratrix writes:

​In the applied stats class ​I’m teaching ​on​ hierarchical models I’m giving the students (a mix of graduate students, many from the education school, and undergrads) a taste of Stan. I have to give them some “standard” way to turn Stan output into a point estimate (though of course I’ll also explain why credible intervals are better if they have the space to give them). ​But what I don’t know is how to tell them to do this? ​What is the state of the art?

First, let’s presume we’re talking about a parameter which is more or less well-behaved: clearly identifiable, reasonably unimodal, at least a few steps along the asymptotic path towards symmetry, etc. In that case, the question is mostly just an academic one, because anything we do will come up with more-or-less the same answer. In cases like that, should we use median, mean, trimmed mean, or what?

Second, what happens when things start to break down? We might be close to the edge of parameter space, or there might be some bimodal situation where two (or more) parameters can combine in two (or more) different plausible ways, and the evidence from the data isn’t yet enough to conclusively rule one of those out. In those cases, plotting the distribution should clearly show that something is up, but what do we do about it in terms of giving a point estimate? The best answer might be “don’t”, but sometimes it can be hard to tell that to the ref… In this kind of case, the mean might be the most defensible compromise, though of course it should come with caveats.

Basically, I need to give them some simple answer that is least likely to raise eyebrows when they’re trying to publish. Bonus points if it’s also a good answer​!

My reply: If it’s one parameter at a time, I like the posterior median. Of course, in multiple dimensions the vector (median of parameter 1, median of parameter 2, . . ., median of parameter K) is not necessarily a good joint summary. So that’s a potential concern. No easy answers! There’s also some work on Bayesian point estimates. It turns out that the best prior, if you’re planning to compute a posterior mode, is not necessarily the prior you’d use for full Bayes:
http://www.stat.columbia.edu/~gelman/research/published/chung_etal_Pmetrika2013.pdf
http://www.stat.columbia.edu/~gelman/research/published/chung_cov_matrices.pdf

Bonus: these papers have education examples!

19 thoughts on “Point summary of posterior simulations?

  1. Doesn’t decision theory provide a very concrete answer to this?

    In other words, given a loss function (e.g., squared error), the Bayes estimate (a point estimate) has concrete frequentist guarantees (admissibility) under very general conditions.

    The question then becomes whether a particular loss function is justifiable which is a context dependent issue, in my opinion.

  2. Following up from Giri’s and Andrew’s comments on issue (1), as estimators,

    • the posterior mean minimizes expected squared error (L2)
    • the posterior median minimizes expected absolute error (L1)

    I don’t know an expected loss function that’s minimized by the posterior mode estimator.

    To answer (2), when things break down, the mean and median maintain the above properties. Posterior modes may not exist even when posterior means and medians exist; an example is beta(theta | 0.5, 2).

    But even when posterior modes or means exist, they are often nowhere near the high-mass volume(s) of the posterior or each other. For example, the posterior mode or mean of a bivariate mixture or the posterior mean of a beta(0.01, 0.01).

    This doesn’t even require asymmetric or non-convex posteriors. The multivariate normal is about as well-behaved as a posterior gets, but the mode and mean aren’t even in the typical set in high dimensions (that is, almost no randomly drawn point will be near the mode). The textbook simulation can be found in a short blog post Rick Wicklin for SAS on the curse of dimensionality.

    We’re trying to write all this stuff up in several introductions at differing levels of mathematical sophistication.

    • Bob:

      The 0/1 loss function is minimized in expectation by the posterior mode estimator. But I find all these discussions of losses to be unpleasantly theoretical. If someone wants to do a decision analysis, I’d like to see real costs and benefits, not these bogus error losses.

      Regarding your bimodal posterior density: I think the message here is not to try to evaluate different point estimates in bimodal examples, but rather to recognize that point estimates don’t generally make sense in bimodal or multimodal settings.

      Good point about the center of a multivariate normal not being typical. For one thing, it’s at the minimum value of the distance from the center! There’s no way of summarizing a distribution by a point, any more than one can summarize the U.S. population by describing a typical person, whatever that would be.

      That said, point summaries can be useful, as long as we understand their limitations.

    • Formally 0/1 loss isn’t a proper loss function if the parameter space is continuous. You have to define the Ln norm and then take the limit n -> infty, but that limit doesn’t behave particularly well, which is to be expected given its dependence on the choice of parameterization!

      • It should approximately minimize a loss function that looks like 1-exp(-(Error/scale)^2) when scale is small, I’d guess. That’s kind of a regularized continuous version of the 0-1 loss.

        But I agree with Andrew, it’s useful mathematically to be able to think about these loss functions, but they don’t correspond to anything real-world usually, and to start to talk about the “right loss function” to summarize a multi-modal distribution with a point summary is really going off the deep end of theory-without-a-purpose :-)

        I personally think ggplot() + geom_density(aes(x=my_param)) is a pretty good posterior summary.

        • Daniel:

          The problem with these density plots is that typically we’re interested in a whole bunch of parameters, not just one or two or three. So it’s good to have point or interval summaries so we can compactly display inferences in a high-dimensional model.

        • I have some ideas on that too. For example my wife had situations where she was interested in levels of expression of several thousand genes. I sorted the quantities by rank order within the wild-type genotype, and then plotted quantity against rank, which because of the sorting produces a smooth curve, then for comparison to the other genotype, I plotted quantity against the rank of the original genotype, which produces a really rough plot over the smooth curve that lets you see both the level and the differences (I actually was plotting a log(quantity))

          In general I think for high dimensional parameter vectors you could do something like spaghetti plots of quantity against the rank of the quantity within the mean vector.

        • Daniel:

          Yes, I do think there are various creative options, and the whole problem of displaying posterior inferences has been understudied, in part because there’s this horrible tradition of looking at only one parameter at a time. My point was just that, if you have a lot of parameters, it can make sense to summarize posterior distributions by point estimates and uncertainties, as in coefplot.

        • once I understood what that was doing, I definitely like it. I’d say again, since the order of the parameters is arbitrary, order them by DESCENDING rank in the average vector, and then treat them as the fourier series of something and plot the functions.

  3. Thanks for following up on my comment, Bob. Actually, in many cases the posterior mode minimizes the Bayes risk for a 0-1 loss in many cases (a loss function that is 0 if and only if the estimator hits the truth exactly, and 1 otherwise). This is not difficult to see in the discrete parameter setting; the loss function nullifies all probability mass except for the estimator, and so if the estimator is chosen to be the parameter with largest probability mass (i.e., the posterior mode), this minimizes the Bayes risk, since the sum of the remaining masses are as small as possible.

    • Reading this again, I don’t like my explanation of the discrete parameter case so perhaps this is clearer. Let L(theta_hat,theta) be a 0-1 loss, meaning it is 0 if theta_hat = theta and 1 otherwise. Then the posterior expectation of L(theta_hat,theta) = sum L(theta_hat,theta)p_theta, where p_theta is the probability mass on theta from the posterior. By the definition of L(.,.) this is the sum of p_theta for values of theta not equal to theta_hat. This is equivalent to 1-p_theta_hat. Hence to minimize this value (the expected loss w.r.t to the posterior), we would like to choose theta_hat such that p_theta_hat is large as possible, which is the (a) posterior mode by definition.

      • Or, suppose instead of minimum loss we talk about maximum gain :-) because it makes things easier, fewer reversals of the sense in which things are good:

        Let G(x,x*) = exp(-((x-x*)/scale)^2), that is the gain is about 1 if x is close to the correct value x* and about 0 for values of x that are more than a few “scale” distances from x*. Suppose “scale” is small.

        then the expected gain for choosing x*=q is integrate(G(x,q) p(x|Data),dx) which is a function of the choice of point q.

        This is the local weighted average probability density value in the vicinity of q, since G is basically a window weighting function around the point q. To maximize it, place q at the point where the locally weighted probability is highest. When scale is infinitesimal, place q at the point of maximum posterior density.

  4. Just posted an example plot with the specifics of the genetics removed.

    http://models.street-artists.org/2016/05/16/high-dimensional-dataparams-as-functions/

    By plotting in rank order you put a kind of smooth structure onto the data. For parameter vectors you’d plot according to the rank order of the parameter within the *mean vector*. So your reference condition would be the overall mean, and then spaghetti plots over the mean vector would give you a sense of both the variation at a point, and the correlations between variations (a particular “shape” that the curve tends to have).

Leave a Reply

Your email address will not be published. Required fields are marked *