Lots of good statistical methods make use of two models. For example:
– Classical statistics: estimates and standard errors using the likelihood function; tests and p-values using the sampling distribution. (The sampling distribution is not equivalent to the likelihood, as has been much discussed, for example in sequential stopping problems.)
– Bayesian data analysis: inference using the posterior distribution; model checking using the predictive distribution (which, again, depends on the data-generating process in a way that the likelihood does not).
– Machine learning: estimation using the data; evaluation using cross-validation (which requires some rule for partitioning the data, a rule that stands outside of the data themselves).
– Bootstrap, jackknife, etc: estimation using an “estimator” (which, I would argue, is based in some sense on a model for the data), uncertainties using resampling (which, I would argue, is close to the idea of a “sampling distribution” in that it requires a model for alternative data that could arise).
This commonality across these very different statistical procedures suggests to me that thinking on parallel tracks is an important and fundamental property of statistics. Perhaps, rather than trying to systematize all statistical learning into a single inferential framework (whether it be Neyman-Pearson hypothesis testing, Bayesian inference over graphical models, or some purely predictive behavioristic approach), we would be better off embracing our twoishness.
This relates to my philosophizing with Shalizi on falsification, Popper, Kuhn, and statistics as normal and revolutionary science.
Twoishness also has relevance to statistical practice in focusing one’s attention on both parts of the model. To see this, step back for a moment and consider the transition from optimization problems such as “least squares” to model-based inference such as “maximum likelihood under the normal distribution.” Moving from the procedure to the model was a step forward in that models can be understood, checked, and generalized, in a way that is more difficult with mere procedures. Or maybe I will take a slightly more cautious and thus defensible position and say that, if the goal is to understand, check, and generalize a learning algorithm (such as least squares), it can help to understand its expression as model-based inference.
Now back to the two levels of models. Once we recognize, for example, that bootstrap inference has two models (the implicit data model underlying the estimator, and the sampling model for the bootstrapping), we can ask questions such as:
– Are the two models coherent? Can we learn anything from the data model that will help with the sampling model, and vice-versa?
– What sampling model should we use? This is often treated as automatic or as somewhat of a technical problem (for example, how do you bootstrap time series data), but ultimately, as with any sampling problem, it should depend on the problem context.
Recognizing the bootstrapping step as a model (rather than simply a computational trick), the user is on the way to choosing the model rather than automatically taking the default.
Where does the twoishness come from? That’s something we can discuss. There are aspects of sampling distributions (for example, sequential design) that don’t arise in the data at hand, and there are aspects of inference (for example, regularization) that don’t come from the sampling distribution. So it makes sense to me that two models are needed.