Dave Judkins writes:

I would love to see a blog entry on this article, Bayesian Model Selection in High-Dimensional Settings, by Valen Johnson and David Rossell. The simulation results are very encouraging although the choice of colors for some of the graphics is unfortunate. Unless I am colorblind in some way that I am unaware of, they have two thin charcoal lines that are indistinguishable.

When Dave Judkins puts in a request, I’ll respond. Also, I’m always happy to see a new Val Johnson paper. Val and I are contemporaries—he and I got our PhD’s at around the same time, with both of us working on Bayesian image reconstruction, then in the early 1990s Val was part of the legendary group at Duke’s Institute of Statistics and Decision Sciences—a veritable ’27 Yankees featuring Mike West, Merlise Clyde, Michael Lavine, Dave Higdon, Peter Mueller, Val, and a bunch of others. I always thought it was too bad they all had to go their separate ways.

Val also wrote two classic papers featuring multilevel modeling, one on adjustment of college grades (leading to a proposal that Duke University famously shot down), and one on primate intelligence.

Anyway, to get to the paper at hand . . . Johnson and Rossell write:

We demonstrate that model selection procedures based on nonlocal prior densities assign a posterior probability of 1 to the true model as the sample size n increases when the number of possible covariates p is bounded by n and certain regularity conditions on the design matrix pertain.

This doesn’t bother me but it doesn’t seem particularly relevant to anything I would study. The true model is never in the set of models I’m fitting. Rather, the true model is always out of reach, a bit more complicated then I ever have the data and technology to fit.

They also write:

In practice, it is usually important to identify not only the most probable model for a given set of data, but also the probability that the identified model is correct.

I take Johnson and Rossell’s word that this describes their practice but it doesn’t describe mine. I know ahead of time that the probability is zero that the identified model is correct.

I’m not trying to be glib here. This is really how I operate. Models, fitting, regularization, prediction, inference: for me, it’s all approximate.

On the practical side, though, the method proposed in the paper might be great. The proposal is for Bayesian regression where each coefficient has a prior distribution that is a mix of a spike at zero and a funny-shaped distribution for the nonzero values. I’d be interested in comparing to a direct Bayesian approach that keeps all the coefficients in the model and just uses a hierarchical prior that partially pools everything to zero.

P.S. To answer Dave’s implicit question: I think Figure 1 would’ve worked better as three small graphs on common scale. It would be more readable and actually take up less space. Also, set the y-axis to go to zero at zero, and remove the box (in R talk, use plot(…,bty=”l”). Figures 2 through 4 would be better as denser grids of plots; that is, use more graphs and fewer lines per graph. Also label the lines directly rather than with that legend, and for chrissake don’t have a probability scale that goes below 0 and above 1. Actually, what’s with those y-axes? 0, .002, .039, .5, .961, .998, 1. Huh?

I guess wacky is in the mind of the beholder (are the priors wacky for their purposefulness?)

I like figure 5, except for the spike – I believe only the smartest amongst us can avoid stepping on them at some point and infinite points are really sharp ;-)

By the way there is video on some of that history here http://stat.duke.edu/ (25th anniversary)

> I know ahead of time that the probability is zero that the identified model is correct. I’m not trying to be glib here. This is really how I operate. Models, fitting, regularization, prediction, inference: for me, it’s all approximate.

Same here. I had a stat mech professor who used to say, “The difference between physicists and physical chemists is that physical chemists understand how to approximate.” At the risk of stating the obvious, he’s a physical chemist.

In my work my goal has typically been to model effects with sufficient fidelity that there are no glaring systematic residuals when the effect is present – systematic residuals should be small relative to the noise inherent in the data. The principal constraint in developing signal models is usually that they be computationally-tractable, i.e., that the code which follows from the H0 and H1 hypotheses be able to render “accept H1″/”reject H1” decisions on a timescale relevant to the user. For most of the applications I’ve worked, getting reasonably accurate answers now (say within 100 ms) is much more useful than getting extremely accurate answers tomorrow or next week. This generally requires compromises in model fidelity. The challenge is identifying what compromises you can live with. What matters is whether the models yield operationally useful decisions, not whether they are rigorously correct from first principles.

I’m not sure what you mean by wacky priors, but I’ll take a stab at it. I’m familiar with Val’s non-local prior work. A key idea is that the prior for an alternative hypothesis needs to really be a distinct alternative, i.e. not overlap with the null. This allows for much faster convergence of the Bayes factors.

Now you could say that the two priors are unbelievable because each assigns zero probability to some region. Neither prior captures anyone’s subjective prior belief, not people who follow Cromwell’s rule. But this makes sense in hypothesis testing. You don’t have one coherent view of the world; you’re deciding between two competing views.

This took me a while to appreciate when I was working with Val. I was so steeped in the mindset of estimation, where his priors would be inappropriate, that it took me a while to see his point.

John:

Your comment is consistent with the title of my post: the prior can be “wacky” in the sense of not being a reasonable probability model, but still “work well” in the sense of delivering good statistical procedures. What you say makes sense, that this all might be possible in a context where the “good statistical procedures being delivered” are hypothesis tests rather than estimates.

In that sense, this is all somewhat related to our recent work on boundary-avoiding priors for point estimates in hierarchical models. These are priors that don’t make sense

as priorsbut perform well if the goal is point (rather than fully Bayes) estimation.