Dave Clark writes:

I was hoping for your opinion on a topic related to hierarchical models.

I am an actuary and have generally worked with the concept of hierarchical models in the context of credibility theory. The text by Bühlmann and Gisler (A Course in Credibility Theory; Springer) sets up the mixed models under the idea of Empirical Bayes formulas, using linear approximation that only require variances. These variances can be calculated directly, without the need for iteration or MCMC.

Formulas 2.8-2.10 of the paper “Combining generalized linear models and credibility models in practice” by Esbjorn Ohlsson (Scandinavian Actuarial Journal, 2008, 4, 301-314) give this result in the most basic case. In business applications, this seems a practical way to implement the theory.

Have you done work on this direct calculation method, or have any concerns on accuracy of results?

I replied: The quick answer is that with modern programs such as Stan, it is not such a big deal to just fit the full model. If you need an approximation for reasons of computing speed, this is fine, but usually I’d just prefer to fit the full Bayesian model directly so I don’t have to worry about these approximations. In addition, point estimates of variance parameters can be noisy; this is another motivation to go full Bayes which will account for all the uncertainty in the model. If you are doing point estimation, it can be useful to do some regularization to avoid some of the biggest problems with noisy estimates; see these two papers:

http://www.stat.columbia.edu/~gelman/research/published/chung_etal_Pmetrika2013.pdf

http://www.stat.columbia.edu/~gelman/research/published/chung_cov_matrices.pdf

Clark followed up:

The main attraction of the direct solution, rather than Bayesian simulation or ML iteration, is the convenience in a business environment. We can get a “real time” answer, which is helpful when we are in the middle of a pricing negotiation. Also, the “credibility” format is easy to communicate to non-technical users, since it easily translates into simple weighted averages and users even know how to over-write model results when they need to. As the full Bayesian approaches become more accessible I do hope to move in that direction.

Clark’s response is a common argument against iterative algorithms (although I think iterative algorithms have been generally accepted in industry/sciences since Fisher, 1925). Conditioning on an analytic solution restricts the class of models considered and accuracy of their estimators; I think justifying these is harder to communicate to non-technical users than to say that the estimation does not have a closed form.

Aren’t “real time” and “closed form solution” fundamentally orthogonal concepts in today’s scenario?

i.e. Given a well designed algorithm, appropriate approximations and a robust IT system, couldn’t a black-box, packaged Bayesian system give answers that seem essentially real-time?

To someone in Clark’s shoes it may be the real-time-ness that sometimes counts a lot. And that ought to be a fixable problem as our methods and computer Hardware advance?

Also, real-time need not mean instantaneous. Having seen the archaic legacy IT systems most of the insurers are using, a 10 second delay in pricing will feel to most users as life as usual.

I think I’m interpreting it with “credibility” as a closed form solution because of his description of simple weighted averages. Certainly a real-time answer, where a solution can be given at any time and it progressively gets better as more time is offered, is to the benefit of iterative algorithms. E.g., imagine solving the normal equations, which will not give an answer until the very end, after having inverted this possibly big matrix. Contrast it to gradient descent, which will always have an answer ready. I think there’s a very classical graph about this on iterative vs direct solvers in numerical analysis.

IMO, this question ought to get more attention than it does: when to use simpler, empirical Bayes vs. hierarchical Bayes and MCMC methods. Thanks for the post w/references above. Berger’s book “Statistical Decision Theory and Bayesian Analysis” also gets into this a little where he does some comparisons (see his section 4.6.4). In his examples, and in my own experience, the two methods yield virtually identical results for the “lower level” parameters. And differences in “upper model” variances can be all but eliminated by tweaking one’s hyperprior (in the HB case) and/or including a hyperprior (to the empirical Bayes prior (and calling it semi-empirical??)).

Moreover, some of my models for clients run into the millions of parameters. After running such models all night via MCMC I can’t afford to examine the results in the morning only to say, “Ooops! That hyper-prior was too loose. Let’s run it again.” Now to be fair, I’ve not run such large models in STAN; only SAS’s PROC MCMC and home-grown R-code. Hence, we often rely on older HB methods in such practical work–and we have no problem communicating the results to business clients.

Stepping back, I’m a huge fan of STAN and its competitors. The tool is a huge advance and, as we all know, permits the estimation of models long thought impossible (including those that HB methods can’t handle). But I’ve been around long enough to see how methodological pendulums can swing in extreme directions, at least in the social sciences. And right now, in some social science fields, MCMC techniques do appear to the latest shiny object. But I don’t believe they’re ALWAYS the significantly better tool.

Mnl:

I actually don’t think that point estimation of the hyperparameters is simpler than full Bayes. Point estimation can be faster to compute, and that can be important, but in my opinion it is less simple in that you don’t just have to deal with the model, you also have to deal with the approximation. If the size of the problem is small so that full Bayes is fast, I think full Bayes is the way to go, no question.

Also, we’re working on including point estimation of the hyperparameters in Stan (for computational reasons). So you should be able to have the point estimation methods you are comfortable with, in the Stan that you love.

P.S. Bayesian inference is already empirical. It is conditional on data. Using an approximation that involves point estimation does not add any “empirical” content. Historically the term “empirical Bayes” was used in contrast to a simpler Bayes in which the hyperparameters were chosen purely from prior information without reference to the data. But for several decades now we’ve had hierarchical Bayes, which can use as much or as little prior information as you’d like.

Mnl:

P.P.S. When these hierarchical models are huge, I suggest setting the hyperparameters to reasonable fixed values and running the non-hierarchical version. This should be faster and in practice can be just fine.

P.P.P.S. In any case, I expect Stan should be much faster and more stable than Sas’s Proc Mcmc. If it is not, please let us know and we will look into it!