Comparing freq. and Bayesian model selection

Gustaf Granath writes:

I am an ecologist. I have been struggling with a problem for some time now and even asked some statisticians about this. It would be interesting for me (and maybe other people reading your blog) to hear your opinion. So far, I have not received a satisfying answer from anyone.

I am doing a meta-analysis (in ecology with normal dist. data) using two different apporaches. My first approach is a frequentist mixed-model, assuming independence of each sample. The second approach is a hierarchical Bayesian model, modelling the dependence structure in the data set (e.g multiple outcomes from each study). I want to investigate if my covariates are important, and since I have many candidate covariates, I need to do some kind of model selection. My questions is then: is there a model selection tool that can be applied on both approaches??

For the frequentist way I would use AICc for model selection. But can I use AICc for the bayesian models as well?? I guess not (p. 525 in “Data analysis in..”) but some people have recommended it.

Another way to compare the two approaches would be to use p-values and backward/frw-selection. This is very straightforward (but criticized) for the frequentist approach, and some kind of multiple test adjustment should probably be used – although I don´t think I have not seen a method to correct for multiple tests in frw/bwr-selection (FDR?). P-values can be calculated for the Bayesian approach as well (using posterior probabilities), but is it acceptable (I am not saying it is the best method) to use those p-values for model selection?

You might think that this is a stupid problem (I agree); you either do a Bayesian or frequentist approach. However, since hierarchical meta-analysis models perform poorly in a frequentist framework a Bayesian approach is a must, but people still want to see the classic freq. mixed-model for comparison.

My reply:

1. You can use AIC in a Bayesian setting. You just have to account for uncertainty. One way of doing this is DIC, which is not perfect (see, for example, here), but the basic idea is sound: it’s to evaluate models based on some assessment of their cross-validated predictive accuracy.

2. You write, “since I have many candidate covariates, I need to do some kind of model selection.” Maybe this is true in practice, but ideally I’d prefer keeping all the candidate covariates and just partially pooling them toward a model. To put it another way, there’s no law that you have to estimate all these coefficients by least squares.

3. I don’t worry about multiple tests. The real issue to me is not the inclusion of predictors but rather the implicit rule that their coefficients will be estimated in a non-regularized or barely-regularized way, which gives noisy estimates.

8 thoughts on “Comparing freq. and Bayesian model selection

  1. 2 and 3 seem to suggest fitting multilevel multilevel models while also including covariates whose coefficients are, say, L1-regularized. I wonder what good tools there are for doing this. It's so easy to do either (e.g. with a number of R packages), but perhaps not so easy to do both? I've wanted to do this before.

    I'd also be interested in reading something that relates the regularization of coefficients for categorical predictors using multilevel models with other approaches to regularization. I've read Gelman and Hill and Hastie et al's Elements of Statistical Learning, but nonetheless could have a better understanding of this — especially in practice.

  2. There is one important technical impediment to using AIC in a Bayesian setting: for most non-trivial models (e.g. hierarchical models), it is not possible to count the number of parameters. In fact, one component of DIC is an estimate of the "effective" number of parameters.

  3. Is it getting at what your writer wants to point out that frequentest shrinkage (eg lasso, elastic net) which does a kind of model selection is also a bayesian procedure taking the MAP model estimate under particular priors? Googling "Penalized Regression, Standard Errors, and Bayesian Lassos" brings up an unpublished article which reviews the connections.

    Is there a generalization of LASSO which treats categorical predictors in a fair way?

  4. Bayesian variable selection? wrote a paper about that, and included the code. I don't believe it's the right thing to do, but there are times it can be useful (e.g. QTL analysis).

    If Gustaf has piles of covariates, and is using BUGS anyway (hey, and who doesn't? :-)), it's easy to plug in the extra code, and it avoids lots of messing around with different model outputs.

  5. Not a common problem?? Unless you have the raw data and the same covariates in all the studies, it will be a real problem to define common (i.e. more technically exchangeable) parameters to work with across the studies.

    For instance, if you are interested in the effect of X1, the effect of X1 adjusted for c1 is not the same effect as X1 adjusted for c2 – i.e. the adjusted X1 effect is not exchangeable here.
    ( Efron and Tbshirani ran into these problems in their work on "Prevalidation" which averaged an effect over different model adjustments )

    And if you are interested in confounding as opposed to covariate adjustment, there is little to suggest that confounding is exchangeable over different studies ( i.e. some are less bad natural experiments than others ).

    May not apply to your situation – but it is the usual one I have run into, only study level summaries and different covariates avaiable and adjusted for in different studies.

    Keith

  6. @Chris

    I think the answer to that question depends on the "level of focus", as described in Spiegelhalter et al 2002 (the original DIC paper) and in [Vaida, F., and S. Blanchard. 2005. Conditional Akaike information for mixed-effects models. Biometrika 92:351-370. doi: 10.1093/biomet/92.2.351]. My interpretation (someone please feel free to correct me if I'm wrong!) is that if you're interested in population- rather than individual-level prediction, the "correct" answer is that a random effect is worth approx. 1 df (or perhaps slightly less than 1 df, accounting for boundary problems).

  7. Ben

    I have been looking at papers describing the difference in focus. It made me wonder: does the packages in R (nlme,lme4) give the marginal likelihood (so you get marginal AIC), or do they give the conditional likelihood?

Comments are closed.