Lots of good statistical methods make use of two models. For example:

– Classical statistics: estimates and standard errors using the likelihood function; tests and p-values using the sampling distribution. (The sampling distribution is *not* equivalent to the likelihood, as has been much discussed, for example in sequential stopping problems.)

– Bayesian data analysis: inference using the posterior distribution; model checking using the predictive distribution (which, again, depends on the data-generating process in a way that the likelihood does not).

– Machine learning: estimation using the data; evaluation using cross-validation (which requires some rule for partitioning the data, a rule that stands outside of the data themselves).

– Bootstrap, jackknife, etc: estimation using an “estimator” (which, I would argue, is based in some sense on a model for the data), uncertainties using resampling (which, I would argue, is close to the idea of a “sampling distribution” in that it requires a model for alternative data that could arise).

This commonality across these very different statistical procedures suggests to me that thinking on parallel tracks is an important and fundamental property of statistics. Perhaps, rather than trying to systematize all statistical learning into a single inferential framework (whether it be Neyman-Pearson hypothesis testing, Bayesian inference over graphical models, or some purely predictive behavioristic approach), we would be better off embracing our twoishness.

This relates to my philosophizing with Shalizi on falsification, Popper, Kuhn, and statistics as normal and revolutionary science.

Twoishness also has relevance to statistical practice in focusing one’s attention on both parts of the model. To see this, step back for a moment and consider the transition from optimization problems such as “least squares” to model-based inference such as “maximum likelihood under the normal distribution.” Moving from the procedure to the model was a step forward in that models can be understood, checked, and generalized, in a way that is more difficult with mere procedures. Or maybe I will take a slightly more cautious and thus defensible position and say that, if the goal is to understand, check, and generalize a learning algorithm (such as least squares), it can help to understand its expression as model-based inference.

Now back to the two levels of models. Once we recognize, for example, that bootstrap inference has two models (the implicit data model underlying the estimator, and the sampling model for the bootstrapping), we can ask questions such as:

– Are the two models coherent? Can we learn anything from the data model that will help with the sampling model, and vice-versa?

– What sampling model should we use? This is often treated as automatic or as somewhat of a technical problem (for example, how do you bootstrap time series data), but ultimately, as with any sampling problem, it should depend on the problem context.

Recognizing the bootstrapping step as a model (rather than simply a computational trick), the user is on the way to choosing the model rather than automatically taking the default.

Where does the twoishness come from? That’s something we can discuss. There are aspects of sampling distributions (for example, sequential design) that don’t arise in the data at hand, and there are aspects of inference (for example, regularization) that don’t come from the sampling distribution. So it makes sense to me that two models are needed.

Great post! I've always felt classical and Bayesian methods solve different problems, and are not perfect substitutes. Glad to see this articulated so well.

The source of twoishness is the essence of testing. I use some method to draw some conclusion and I use some other method to see if the first method makes sense. There's really no way to use only one method, because you'll never figure out whether the model doesn't fit at all or the data are just noisy (and of course that isn't really a binary choice, but it is, it seems to me where the twoishness comes from) — so the second method helps sort that out.

Can you suggest some refs discussing the difference between likelihood and sampling distributions, assuming it's more than simply adding a prior term?

This is an important conversation to have. I am always dissatisfied when cross validation is "explained" to me by describing the procedure. That's not an explanation…

Should these procedures be thought of as approaches to model selection, without appealing to the theoretical assumptions/pitfalls underlying AIC/BIC/bayes factors? Could these approaches be considered to be "empirical bayes model selection" of some sort?

Joshua:

See exercise 6.6 of Bayesian Data Analysis for an example.

> Where does the twoishness come from? That's something we can discuss. There are aspects of sampling distributions (for example, sequential design) that don't arise in the data at hand, and there are aspects of inference (for example, regularization) that don't come from the sampling distribution. So it makes sense to me that two models are needed.

I think I understand. Is "twoishness" this?: [1] a model of reality, possibly with informed changes over time, [2] a cohesive model of possible or realized experimental interventions and data collected, over time or as a single event, [1+2] how both relate to each other.

The point being that [1] we have, situationally, a preferred family of models of reality, [2] we have a workable model of possible experiments, and [1+2A] we expect our current best model of reality to inform our experiments and [1+2B] we expect our latest experimental data to update our best model of reality. The demands of [1] [2] [1+2A] [1+2B] are so rigorous that any choice of one part forces a type of choice for the rest of the parts.

Your blog post got me off my duff to talk about some related topics on my blog. The issue of where the heck does the likelhood come from is one I think about a lot as our group works on statistical methods for dynamical systems and physics type models

http://models.street-artists.org/?p=1010

Specifying a statistical model for dynamics experiments is a pretty hard problem by itself, and there are often not a lot of "standard" methods that make any sense in this domain.

An interesting application in medicine is also topical due to the rise of Pertussis: See here:

http://www.ncbi.nlm.nih.gov/pubmed/19876392