Skip to content

Useful models, model checking, and external validation: a mini-discussion

I sent a copy of my paper (coauthored with Cosma Shalizi) on Philosophy and the practice of Bayesian statistics in the social sciences to Richard Berk, who wrote:

I read your paper this morning. I think we are pretty much on the same page about all models being wrong. I like very much the way you handle this in the paper. Yes, Newton’s work is wrong, but surely useful. I also like your twist on Bayesian methods. Makes good sense to me. Perhaps most important, your paper raises some difficult issues I have been trying to think more carefully about.

1. If the goal of a model is to be useful, surely we need to explore that “useful” means. At the very least, usefulness will depend on use. So a model that is useful for forecasting may or may not be useful for causal inference.

2. Usefulness will be a matter of degree. So that for each use we will need one or more metrics to represent how useful the model is. In what looks at first to be simple example, if the use is forecasting, forecasting accuracy by something like MSE may be a place to start. But that will depend on one’s forecasting loss function, which might not be quadratic or even symmetric. This is a problem I have actually be working on and have some applications appearing. Other kinds of use imply a very different set of metrics — what is a good usefulness metric for causal inference, for instance?

3. It seems to me that your Bayesian approach is one of several good ways (and not mutually exclusive ways) of doing data analysis. Taking a little liberty with what you say, you try a form of description and if it does not capture well what is in the data, you alter the description. But like use, it will be multidimensional and a matter of degree. There are these days so many interesting ways that statisticians have been thinking about description that I suspect it will be a while (if ever) before we have a compelling and systematic way to think about the process. And it goes to the heart of doing science.

4. I guess I am uneasy with your approach when it uses the same data to build and evaluate a model. I think we would agree that out-of-sample evaluation is required.

5. There are also some issues about statistical inference after models are revised and re-estimated using the same data. I have attached “>a recent paper written for criminologists, co-authored with Larry Brown and Linda Zhao, that appeared in Quantitative Criminology. It is frequentist in perspective. Larry and Ed George are working on a Bayesian version. Along with Andreas Buja and Larry Shepp, we are working on appropriate methods to post-model selection inference, given that current practice is just plain wrong and often very misleading. Bottom line: what does one make of Bayesian output when the model involved has been tuned to the data?

My reply:

I agree with your points #1 and #2. We always talk about a model being “useful” but the concept is hard to quantify.

I also agree with #3. Bayes has worked well for me but I’m sure that other methods could work fine also.

Regarding point #4, the use of the same data to build and evaluate the model is not particularly Bayesian. I see what we do as an extension of non-Bayesian ideas such as chi^2 tests, residual plots, and exploratory data analysis–all of which, in different ways, are methods for assessing model fit using the data that were used to fit the model. In any case, I agree that out-of-sample checks are vital to true statistical understanding.

To put it another way: I think you’re imagining that I’m proposing within-sample checks as an alternative to out-of-sample checking. But that’s not what I’m saying. What I’m proposing is to do within-sample checks as an alternative to doing no checking at all, which unfortunately is the standard in much of the Bayesian world (abetted by the subjective-Bayes theory/ideology). When a model passes a within-sample check, it doesn’t mean the model is correct. But in many many cases, I’ve learned a lot from seeing a model fail a within-sample check.

Regarding your very last point, there is some classic work on Bayesian inference accounting for estimation of the prior from data. This is the work of various people in the 1960s and 1970s on hierarchical Bayes, when it was realized that “empirical Bayes” or “estimating the prior from data” could be subsumed into a larger hierarchical framework. My guess is that such ideas could be generalized to a higher level of the modeling hierarchy.


  1. Bill Jefferys says:

    Nice paper, Andrew.

    There's a small typo on p. 3 where you wrote: "This can does not fit our own experiences of learning by finding that a model doesn't fit and needing to expand beyond the existing class of models to fix the problem." Delete 'can'.

  2. J.J. Hayes says:

    What I found interesting in your paper, vis a vis Popper, is that Popper actually made much of the notion that when we test theories we are usually developing tests to choose between two theories. It was with Kuhn that I associate the notion that revolutions occur in science when the number of facts that don't fit the existing paradigm reaches a crisis point. In your paper you embrace the notion of testing without the necessity of an alternative hypothesis, i.e. for each model intentionally putting oneself in that crisis position of knowing things aren't quite right but without a proposed alternative to turn to. It is Popper's notion of self-criticism, testing and falsification for sure, but without the net of an alternative hypothesis. It is like some Kuhnian crisis instigation at every step of the way. Which is very cool. I can therefore understand the temptation to think of scientific revolutions fractally, but we may be running the risk of confusing hypothesis and theory with paradigm. I think the crises that Kuhn talks about have to be viewed as challenges to a "normal" science that an entire generation of scientists has been taught, and within which worldview they conduct their research, something that is which many people have entire lifetimes invested in. Kuhn is talking after all about a sociological phenomenon as it were, whereas Popper is talking about an epistemological approach.

  3. manuelg says:

    quoting Gelman:

    "We always talk about a model being "useful" but the concept is hard to quantify."

    Simply build a model of costs and gains and methods of comparison between models! If a model is good enough for your work, a model must be good enough as a working definition of "useful"!

    Sometimes the best answer to "Why" is "Just because". Sometimes the best mechanism for rating different models is another model. The Skeptics will always howl, so you simply have to demonstrate that their own behavior is consistent with putting undue confidence in their own model, whether a conscious model or unconscious. (And, it must simply terminate with a model, because of the limits of the tools available to the human brain. Only a model, probably over-simplified, can be manipulated with the agility needed to predict future outcomes of the universe from actions considered now, in real-time.)

    Just keep asking the Skeptic "Why" with regards to their own personal actions, and when they hit the "Just because" point, they probably have described a model of utility, assumed true without proof, as an answer to "Why" in the previous step.

    If the Skeptic refuses ultimate responsibility over their personal actions, and tries to plead pure capriciousness or mystery, then their model is simply statistical, based on stimulus and internal states (like stress) that can be approximately discovered with objective external measures (like galvanic skin response). Of course, it is easier to plead pure capriciousness or mystery than demonstrate it – if their behavior is well predicted by a deterministic model suggested by another, the Skeptic is shut up. Most times the reason for behaviors is gross and banal, no matter how elevated the sophistry of the Skeptic.

  4. edkupfer says:

    The link in Berk's reply point #5 is broken. I'm not sure if it intended to go anywhere, but I am interested in taking a look.

  5. Kaiser says:

    This is a great applied paper; it is great to have some discipline in model-building. I hope some economists read the paper. Seems very relevant to their model-building.
    I think some sort of systematic way to query the model space, kind of like stepwise regression (in the "old world"), should be developed. Stepwise has its problems but at least it is a fully fledged process that one can follow with reasonable results.
    When I read the example in your paper, I was wondering do you have to "backtest" the new model developed for 2008 on the 2004 data to see if it is stable? Does the check falsify the model for 2008 data or for all years? More generally, how do we accept/reject candidate models in a large scale?

  6. Koala says:

    What does everyone think on this Wired article:

    To me, words like "end of scientific process", indicate a state of mind that says "forgo priors, forgo models", will lead to overfitting and ultimate failure.

  7. J.J. Hayes says:

    I could nitpick this article, starting with the notion that google doesn't use semantic or causal analysis, which I believe is simply wrong since google has, I think, had to constantly change its algorithm to stop from being gamed by phony links etc., or that I suspect there is no such thing as even a correlation without a prior model. But really the whole idea is misconceived and infected with a sort of essentialism. It relies upon the buzzwords "science" or "biology" and the like, as if they are anything but labels for things people do of a certain nature. SOMETHING advances with just finding correlations without causation, it may even be part of what we call "science", and there may be tons of people interested in finding correlations or letting computers find correlations, but there is something else which has traditionally been labelled as "science" in which people endeavor precisely to find the causes of things, and which often relies on correlation as a tool to that end. So yeah we can use a metaphorical cloud (there is no cloud just lots and lots of hardware and software) to generate connections and correlations. That's cool. But it's just not what a lot of us are interested in. But keep the correlations coming and we'll keep trying to think of models that might explain what CAUSES them to be that way.

  8. K? O'Rourke says:

    Having only read the first paragraph of the Wired article, my model of it suggests it would not be worth reading.

    Any representation or sign (something that stands to someone for something in some sense) is a model. Our representations of ourselves to ourselves (consciousness) are models (and as well all know and fear they are all wrong).

    And it would be just _wired_ to have a model of “not having a model”.


  9. Otto says:

    An useful clarification in

    "The ironic thing is that even the article's author, Chris Anderson, doesn't believe the idea. I saw him later that summer at Google and asked him about the article, and he said "I was going for a reaction." That is, he was being provocative, presenting a caricature of an idea, even though he knew the idea was not really true."

  10. Tom Moertel says:

    I'm jumping into the conversation late (ah, the perils of vacationing away from the Internet), but I'm interested in the following claim:

    The main point where we disagree with many Bayesians is that we do not think that Bayesian
    methods are generally useful for giving the posterior probability that a model is true, or the
    probability for preferring model A over model B, or whatever. Bayesian inference is good for
    deductive inference within a model, but for evaluating a model, we prefer to compare it to data … without requiring that a new
    model be there to beat it.

    Since you seem to prefer graphical methods for model-checking, are your model checks not inherently an application of (subconscious) Bayesian inference? When you look at graphs and see that the data "fit" (or not), are you not actually comparing two models — the model you want to check vs. the model your brain's marvelous visual machinery infers from the data? Doesn't "falsification" occur precisely when your brain believes that the second model has beaten the first?


  11. Andrew Gelman says:


    No. First, the alternative I'm considering is not a single model. Second, if the second model "has beaten" the first (as you put it), this is a statement about fit, not about posterior probability of either model being correct. As I've discussed in various places, the posterior probability of a model is typically not well defined, as it tends to depend strongly on aspects of the prior distribution that have essentially no impact on the posterior distribution of the parameters within the model.

  12. K? O'Rourke says:

    Andrew: if you would be so kind – an example to flesh out the meaning of your last clause "as it tends to depend …" might be helpful for many.

    (I did search your pdf in the link for some matching phrases, and none were found)


  13. K? O'Rourke says:

    And for "about fit, not about posterior" this reminds me of the old quandry for formalizing preferences.

    Would you prefer lemon, apple or cherry pie?


    Oh, we also have banana cream.

    Given you have mentioned that, I'll have cherry pie.


  14. Andrew Gelman says:


    See chapter 6 of BDA, especially section 6.7, in particular the second example on page 185.