Nick Firoozye writes:
I had a question about BMA [Bayesian model averaging] and model combinations in general, and direct it to you since they are a basic form of hierarchical model, albeit in the simplest of forms. I wanted to ask what the underlying assumptions are that could lead to BMA improving on a larger model.
I know model combination is a topic of interest in the (frequentist) econometrics community (e.g., Bates & Granger, http://www.jstor.org/discover/10.2307/3008764?uid=3738032&uid=2&uid=4&sid=21101948653381) but at the time it was considered a bit of a puzzle. Perhaps small models combined outperform a big model due to standard errors, insufficient data, etc. But I haven’t seen much in way of Bayesian justification.
In simplest terms, you might have a joint density P(Y,theta_1,theta_2) from which you could use the two marginals P(Y,theta_1) and P(Y,theta_2) to derive two separate forecasts. A BMA-er would do a weighted average of the two forecast densities, having previously had a model prior. A large-scale modeler would instead consider the larger model.
It is as though the BMA-er thought P(Y,theta_1,theta_2) was separable (i.e., the effects of the two parameters were independent?) and could easily be represented by (some weighted average?) of the two marginals P(Y,theta_1) and P(Y,theta_2).
In the simplest of cases, why would anyone want to do that? Is it an inability to impose proper priors on the larger parameter space? Would collinearity be an issue? (I know this is less of an issue for a Bayesian than a frequentist).
Of course I’m thinking in terms of simple easily combined models (e.g., a regression on two variables), and a BMA-er could easily combine far more challenging models that don’t naturally form a supermodel.
My reply: Conditional on being required to use noninformative priors on each submodel, the strategy of model averaging or model selection can be better than using the larger model. But I agree that, if you’re thinking of fitting the small model or the large model, it makes more sense to use an informative prior that allows for shrinkage directly. As to your other question, about combining incompatible models, I think it best to create a supermodel that continuously expands the original options. Not that I always (or even usually) do this, but I think it’s the right way to go.
Firoozye follows up:
This makes sense in that one could easily get problems of ill-posedness if there is insufficient data if using uninformative priors.
I believe some of the early examples of BMA involved running 2^k regressions and averaging rather than running a single k dimensional regression with shrinkage (so much simpler doing a lasso,… er…, shrinkage estimator, to be honest). And the BMA was meant to be preferable to frequentist sequential model selection or using a criterion which involves the 2^k regressions anyway.
Isn’t BMA a method for forming hierarchical models? And if so aren’t you saying it is better to have a single large non-hierarchical model than it is a hierarchical model?
If we extend the question to continuous parameters (not just discrete model choices as in the BMA), we usually consider P(Data | theta) for the model likelihood and P(theta | phi) as the prior and P(phi) as the hyperprior, but couldn’t this all have been done as P(Data | theta, phi) and P(theta, phi)? If we had informative priors on both parameter (and hyperparameter) we might have a better model yet?
If so, why should we be doing hierarchical models at all, other than they can be far more intuitive than super-models?
It’s a genuine interest, because I am very interested in second-order probability as a way of capturing uncertainty together with smooth ambiguity aversion, something I find far more tractable, and even more subtle, than all the very complex imprecise probability settings, which for all their complications only lead to min-max decision rules which would come out of belief/plausibility functions, etc. While Dempster and Shaffer do extend Bayes’ theorem to their special case, it just seems their theory is so more complex than just using a hierarchical model. Second order probability is merely a hierarchical model, with weights being put on a family of probability measures, and so much more intuitive than belief functions.
Indeed, discrete model averaging can be seen as a sort of implementation of continuous model expansion in which the probability of setting a coefficient to zero is a way to get some shrinkage. I just don’t see it as a model that makes much sense in the applications I work on. For similar reasons, I have no particular interest in seeing which sets of predictors the fitted model wants me to include.
On your larger question: yes, I think hierarchical priors can be useful in specifying dependent uncertainty. One of my favorite examples is the two-level prior distribution in our toxicology paper (an example we also discuss in BDA). As for belief functions, they just mystify me (see example here).