## I (almost and inadvertently) followed Dan Kahan’s principles in my class today, and that was a good thing (would’ve even been more of a good thing had I realized what I was doing and done it better, but I think I will do better in the future, which has already happened by the time you read this; remember, the blog is on a nearly 2-month lag)

As you might recall, the Elizabeth K. Dollard Professor says that to explain a concept to an unbeliever, explain it conditionally. For example, if you want to talk evolution with a religious fundamentalist, don’t try to convince him or her that evolution is true; instead preface each explanation with, “According to the theory of evolution . . .” Your student can then learn evolution in a comfortable manner, as a set of logical rules that explain certain facts about the world. There’s no need for the student to believe or accept the idea that evolution is a universal principle; he or she can still learn the concepts. Similarly with climate science or, for that matter, rational choice or various other political-science models.

Anyway, in my Bayesian data analysis class, I was teaching chapter 6 on model checking, and one student asked me about criticisms of posterior predictive check as having low power, and another asked why we don’t just do cross-validation. I started to argue with them and give reasons, and then I paused and gave the conditional explanation, something like this:

It’s important in some way or another to check your model. If you don’t, you can run into trouble. Posterior predictive check is one way to do it. Another way is to compare inferences and predictions from your model to prior information that you have (i.e., “do the predictions from your model make sense?”); this is another method discussed in chapter 6. Yet another approach is sensitivity analysis, and there’s also model expansion, and also cross-validation (discussed in chapter 7). And all sorts of other tools that I have not personally found useful but others have.

Posterior predictive checking is one such tool. It’s not the only tool out there for model checking, but a lot of people use it, and a lot of people (including me) think it makes a lot of sense. So it behooves you to learn it, and also to learn why people like me use this method, despite various criticisms.

So what I’m going to do is give you the internal view of posterior predictive checking. If you’re going to try this method out, you’re gonna want to know how it works and why it makes sense to people coming from my perspective.

I didn’t quite say it like that, and in fact I only thought about the connection to Kahan’s perspective later, but I think this is the right way to teach it, indeed the right way to teach any method. You don’t have to “believe” in posterior predictive checks to find them useful. Indeed, I don’t “believe” in lasso but I think it has a lot to offer.

The point here is partly to reach those students who might otherwise be resistant to this material, but also more generally to present ideas more conditionally, to give some separation between the internal logic of a method and its justification.

1. Jeremy Fox says:

Hmmm…does the analogy to teaching evolution to a fundamentalist really work here? In that case one really can just focus on the “internal logic” of evolution as a field of science and set to one side the quite distinct issue of its religious implications or lack thereof. In contrast, can different approaches to model checking be separated so neatly? Rather, aren’t they all different ways of trying to do the same thing–namely, check one’s model?

Of course, different model checking approaches surely have their own advantages and disadvantages, which might well be context-dependent. So that choosing which one to use comes down to making informed professional judgements on which reasonable people might disagree. And so that even if you choose method X, you can appreciate why others choose method Y (and as you say, you *do* want to appreciate why your fellow professionals make the choices they do, even if you make different ones). Indeed, it sounds like that’s exactly the way you teach model checking? But all that seems rather different than keeping each method separate from the others, deliberately avoiding any attempt to compare and contrast them.

Or maybe I’m misunderstanding, and you see different methods of model checking as trying to do quite different and perhaps incomparable things that all happen to come under the very broad heading of ‘model checking’? (Perhaps like how evolutionary biology and fundamentalism are trying to do quite different things that all happen to come under the very broad heading of ‘making sense of the world’?) So that one can and perhaps should focus on what each model checking method is trying to do and how it does it, without getting into the larger issue of which of those different things are “best” to do?

There’s also the question of whether the personal identities of statistics students are likely to be tied to their views on model checking, thereby possibly causing them to reject any information that seems like a threat to their identities. Your students’ questions sound to me more like questions about the relative advantages and disadvantages of different model checking approaches. Even if they’re initially skeptical of why one might want to do, say, posterior predictive checks, that seems like a different sort of thing than a fundamentalist’s resistance to anything that seems to threaten his or her personal identity.

• I’ve found lots of people hold onto the way they were taught how to do something very strongly, even in the halls of academe.

Personally, I was brought up by wolves, specifically machine learning researchers concentrating on natural language processing and speech. Wolf law dicatates that the only acceptable form of fighting for pack dominance is with 0/1 loss on held out data. Cross-validation is an acceptable proxy if someone at some point didn’t lay down an “official” training/test split of the data. Wolf custom involves ignoring estimation uncertainty and prediction variance, and believing in the one true train/test split (what Tversky and Kahneman call the “law of small numbers”).

I was taught that you can’t do anything with the training data other than train (i.e., estimate) a model to evaluate based on some held out piece of test data. Anything else was cheating, and wolves don’t cheat.

Wolves have a laser-like focus on prediction. They have utter disregard for estimated parameter values. Maybe that’s because wolves cut their canines on language data and models, where it’s not uncommon to have 1M predictors for a classifier, or with 20K Gaussian mixture models for acoustic speech recogntion, each with 40 dimensions and 16 mixture components.

As I grew older, I found myself living among sheep, specifically Bayesian statisticians and scientists in physical and biological and social sciences. Sheep are not afraid of navel gazing, that is, looking at how well the model fit the training data. In fact, it turns out sheep spend most of their days navel gazing.

How are the sheep ever going to lead a pack of wolves with that kind of behavior?

(I really should make this a separate post as comment bait.)

• Rahul says:

I kinda like the wolf attitude more than the sheep. It is hard to cheat or bluff or pretend or exaggerate when forced to demonstrate the raw strength of your approach by a clear-cut predictive test.

OTOH, the sheep thrive on discussing and convincing each other with wordy stories & explanations as to which are the better models. It’s often not the best model that wins but the best model story teller.

• On the yet-another-hand, if you are interested in the temperature distribution of the Corona of the sun, this is not an directly observable quantity, so we’re going to have to get some data (maybe spectral data or something) and then we’re going to have to fit a model to the data, and we’re not going to care even a LITTLE bit about predicting future data, we’re ONLY going to be interested in what have we learned about the parameter (Temperature and its distribution in the Corona).

a LOT of good science has this character.

• After some off-blog e-mail, I realized my little story never directly expressed my hypothesis that the hard line many machine learning researchers take on not using the training data for anything other than estimation may explain why Andrew gets resistance to posterior predictive checks in a class with a high concentration of machine learning students.

Please let me clarify that I don’t see anything wrong with “sheepish” (to extend my analogy) behavior (i.e., analyzing the training data and model fit on the training data without holding out test data). I’m usually more concerned with prediction, but evaluating how well a model fits the data used to estimate can be useful. It still drives most of the scientific applications of statistics, in fact. You don’t see CERN running held-out predictive tests to detect the Higgs boson. You don’t typically see bioinformaticians using held out data to test gene expression levels or do genome wide associations.

I also find that a lot of the machine learning folks cheat by their own standards. It’s absolutely ridiculous to me that every paper on part-of-speech tagging or parsing must train on a given section of the Penn Treebank and then test on another specified section. Eric Ringger sums it up nicely in a 2004 paper on parser evaluation:

The former has been divided by the parsing community into a standard training set (sections 02-21), a development test set (section 24), a blind test set (section 23), and some remainder sections. Charniak (2000), Collins (1997), and others use this standard division.

The problem is that this division of the data into test, held out development, and testing folds was done way back in the 1990s and has been used ever since. It’s hardly like section 23 is “blind” any more. I rather suspect it’s been overfit, because hen you run cross-validation on the sections, there’s almost an order of magnitude more variation across sections than there is improvement from the latest overfit model. Yet if you don’t use that split, your paper will be summarily rejected for improper methodology.

I actually suspect that many people choose as their optimization criterion the performance on the test set; it’s usually not very clear in the papers. I know that lots of people optimize cross-validation performance on their folds of choice and then report that without a nod to variability. You don’t need to just focus on the training data to detect fit variance, but it’s one thing the statisticians know how to do that the machine learning people seem to make a methodology out of getting wrong. (The other skeleton in the closet is that test items are rarely i.i.d., so many of the significance tests for improvements are grossly overestimated due to correlations among the test items that won’t hold up on out-of-sample data.)

• I think that the poor practices of the Natural Language Processing community is too much of low-hanging fruit to attack. Are there any examples of machine learning fields that do overfitting on training data?

One could defend the NLP people by the way, by saying that their methods have worked out OK in the real world of applications (happy to be corrected, since Bob is the real expert there).

• As Andrew’s made a theme of in his blog of late, poor statistical practice permeates science. I don’t think NLP is unusual in this regard.

As far as general machine learning goes, the field has a focus on establishing benchmark problems and then comparing methods on them. A typical JMLR paper has the following form: problem description, new algorithm and convergence proof, experimental evaluation on benchmark data sets.

And just to be clear, I’m not saying what the machine learning people are doing is wrong; just that it’s limiting if used to the exclusion of other techniques. I like benchmark problems (when not overzealously overfit).

If you look at Aleks Jakulin’s paper with Andrew on default priors, they fruitfully combine a machine learning and stats perspective. It’s like Bayesian stats in its consideration of populations of problems and informative priors and it’s like machine learning in its evaluation on benchmark problems (lots of UCI classification problems). I’d say you see the same kind of synthesis in Aki Vehtari’s papers with Andrew on information criteria and cross validation (now folded into a chapter of Bayesain Data Analysis, 3rd edition).

I think there’s a lot to be learned at the intersection of the fields.

• Anonymous says:

Bob, this comment was inadvertently funny:

“Wolves have a laser-like focus on prediction. They have utter disregard for estimated parameter values.”

Jaynes in his book’s bibliography mentioned a paper by Efron on the bootstrap, jackknife and cross-validation and had this to say:

“Orthodox statisticians have continued trying to deal with problems of inference by inventing arbitrary ad hoc procedures instead of applying probability theory. Three recent examples are explained and advocated here. Of course, they all violate our desiderata of rationality and consistency; the reader will find it interesting and instructive to demonstrate this and compare their results with those of Bayesian alternatives”

So let me take Jaynes’s advice here. Let t be a parameter (whether we’re thinking of fitting this parameter inside a model or choosing a parameter which indexes a family of models is irrelevant). Let the data be D be split by us into a “past” data set and a “future” data set, D = Dp, Df, for cross-validation purposes.

Then Bayes theorem gives:

P(t|D) = P(D|t)P(t)/P(D)

but using the division of D=Dp,Df this can be rewritten:

P(t|D) = P(Df|t)P(t|Dp)/P(Df|Dp)

From this we can see that fitting a model (i.e. finding a t that makes P(t|Dp) large) and then checking to see it’s predictive accuracy (i.e. seeing if P(Df|t) is large) is nothing other than A POOR MAN’S BAYESIAN POSTERIOR PARAMETER ESTIMATE (i.e. finding a t which makes P(t|D) large).

It’s left as an exercise for the reader to think up instances when these two approaches start to diverge and check to see that the Bayesian posterior is superior.

Looks like Jaynes and Bayes win again.

• Anonymous says:

Let me restate a little more clearly. According to Bayes Theorem, fitting/finding a model on a training data set and checking it predictive accuracy on an out of sample data set is basically the same as looking at the bayesian posterior conditional on the combined training + out of sample data.

• D.O. says:

Eh, but the whole idea of training/test split is that you basically don’t care about fitting data well, never mind the parameters of the model. You want to know whether the method that you use works (that is, will fit future data well). No?

• Anonymous says:

Whether the method works well in your sense is basically something like P(t|D) having a sharply peaked distribution so that P(t|D) is very high for the t chosen. The Bayesian posterior is providing everything that the cross-validation provides plus a good deal more information. That’s in part why if you take cross-validation to extremes you get nonsense, but posterior still works, or at least warns you by spreading P(t|D) out that there’s a problem.

It’s irrelevant whether we’re considering “fitting” a model by finding a t withing a given model, or “finding” a model by choosing a t each of which represents a different model.

• D.O. says:

I am not sure I understand you completely, but how do you know that your posterior distribution has sharp peak if you don’t care about the parameters themselves? You might want to compare, say, neuron network model to some multiple regression. Or some other completely different approaches where you care about predictive power and nothing else. I personally find such questions uninteresting, but others do it for a living.

This comment confused me at first. It would have been clearer (to me) if you restricted to the case where t is a model index rather than a parameter in the usual sense and mentioned that you have in mind the case where cross-validation is used for model selection. That would make sense because, when t is a model index, P(t|D) really is the quantity of interest in a common type of cross-validation study (but not so much when t is an ordinary parameter).

The other use of cross-validation is to get an estimate of predictive performance (before rolling out the system for real-world use). I take it you are not criticising that use here (if you are, do you have an alternative to propose)?

• Andrew says:

Yes, in BDA we distinguish between inference, model checking, and estimation of predictive performance. These are 3 different tasks, and I think lots of confusion arises because people somehow think the same tool should work for all 3. This leads to abominations such as BIC.

• Anonymous says:

It’s not a question of how we’d like to think about it, or which way is less confusing, or trying to use one tool for everything. It’s not a question about us at all. It’s purely a question about where the mathematics leads.

Bayes theorem leads to the following when written in the form P(t|D) \propto P(Df |t)P(t|Dp). Choosing or finding a good model warranted by a training set and then only accepting it if it tends to have good predictive performance out of sample is (basically/morally) equivalent to finding a t that makes P(t | training + out of sample data) high. Again: it’s irrelevant what any of us *think* about that. The numbers work out that way.

There is perhaps a lot more that needs to be said, but this is a blog comment, so I can’t explain everything. It requires some effort on the part of the reader to think through the equations and see their implications themselves. To take just two examples.

1) notice in the version of Bayes theorem given originally the division D=Dp, Df is arbitrary. Any division of the data will work in that statement. It doesn’t even have to be split into literal “past” and “future”. It somehow includes every “leave one data point out” for all data points.

2) P(t|D) contains in a sense all the information about the predictive accuracy for different division of the data D =Dp, Df. If Konrad wants an estimate of predictive performance (in the cross-validation sense) it should be derivable from P(t|D) (which is proportional to P(Df|t) for any Df after all).

So you all can think (or not) whatever you want. I was talking about Bayes theorem which still holds true regardless.

• Andrew says:

Anonymous:

We discuss this all in chapters 6-8 of BDA. In short: (a) there are aspects of the joint distribution (that is, the probability model) that are relevant for model checking but not for posterior inference conditional on the model, and (b) predictive performance is defined relative to particular predictions of interest, and, when comparing two models, model A can have higher marginal posterior probability while model B has better predictive performance.

More generally, all our models are wrong, yet all our theorems are conditional on our models. Which is one reason I am uncomfortable with methodological absolutism of any kind, whether it be the anti-Bayesian ideology I encountered at work in the early 1990s or various Bayesian attitudes I’ve seen over the years. I continue to stand by footnote 1 of this article.

• Anonymous says:

Once again this isn’t about our methodological absolutism or *us* in any way what whatsoever. It’s about Bayes theorem which continues to hold true regardless of whether you want to look at all it’s implications or not.

Cross-validation is an intuitive ad-hoc device derived from nothing. It mimics what Bayes theorem is doing in some instances. There’s much more contained Bayes theorem however than was seen intuitively by the inventors of cross-validation. So by studying Bayes theorem harder it’s an opportunity to educate and improve our intuition.

In particular, the division of our data into training and prediction is arbitrary. We only have one giant/combined data set. You can possibly have a t* which seems to be predictively accurate for one arbitrary division, but doesn’t make P(t* | all data) particularly high. But that division isn’t unique or god given. Even if you stick to a time ordering of the data there are many ways to divide the data. The value of t that makes P(t|all data) high is one that tends to work well across all those.

Faced with that observation you can do one of two things. Thank Bayes theorem for improving your intuition and helping understand the limits of cross-validation better. Or you can reject Bayes theorem.

P.S. that “all our models are wrong” is a crock. I have a model for your weight. I calculated the 99.9% Credibility interval from it and it said your weight was between 0 and 2000lbs. Is that model wrong? do you weigh more than a ton Andrew? If so, you’re a fatso.

Or better yet I have a family of models indexed by T from 0-10,000lbs each of which places your weight between (T, T+10 lbs). Are all those models wrong? Not one of them gets your weight right? If I use Bayesian model selection on that family am I hosed from the get-go because every one of those models is “wrong”?

• Andrew says:

Anonymous:

As we discuss in chapter 7 of BDA3, cross-validation can be seen as a method for estimating out-of-sample prediction error.

P.S. For convenience, I will restrict my “all models are wrong” statement to: “All models are wrong in the fields in which I have worked, including political science, sociology, laboratory assays, toxicology, and environmental risks and exposure.”

• Anonymous says:

I understood purpose of cross-validation from the beginning. It changes (or contradicts) nothing I said.

That’s a reasonable restriction Andrew. Just be aware that when a carpenter builds a crooked house people don’t blame Euclid for dealing in un-realities. They blame the carpenter.

• Anonymous says:

I had no idea it would be this difficult to get Bayesians to look at the implications of their own equations.

• Anonymous: this stuff is what I miss from the loss of your blog.

Also: all models are wrong: one of your weight models surely accounts for the proper weight of Andrew, if only we had an actual precise description of what “Andrew” is (hint: at what point do the Oxygen molecules inhaled into his lungs actually become part of Andrew? when they diffuse into the bloodstream, when they enter the mouth? Also, how to account for perspiration, haircuts, and dandruff??

;-)

Estimating predictive performance has nothing to do with P(t|D). The problem is to evaluate P(f(Df)|D) for a known function f() of some Df that has not been observed. (This is reminiscent of a frequentist analysis where one works with an expectation over imaginary data.) Given a model (or set of models with associated priors), Bayes tells us how to calculate this (in principle, not in practice). But the problem is that the answer will be _very_ sensitive to model misspecification: wat we really want is to calculate P(f(Df)|D) _while taking account of the possibility that all of the models in the set may be poor_. I do not see how Bayes theorem provides a solution to this.

ps. All members of the family of models of Andrew’s weight are wrong because none of them account for the fact that his weight varies over time. (Presumably Andrew weighs more after dinner than he does before.) Of course you can fix this loophole, then we can just move on to another – there’s no end to such loopholes. The single broad model is also wrong because it does not accurately describe our knowledge of Andrew’s weight, assigning (I assume; you didn’t actually say what the model is) an overly high probability to the hypothesis that Andrew weighs over 1000lbs.

2. Rahul says:

If you are vague & hedge you will face less opposition than if you take up a firm position? :)

3. D.O. says:

I am all for “separation between the internal logic of a method and its justification”. Moreover, I am not sure how it can work otherwise. If I doubt the relevancy of calculating p-level for some particular application, should I do the calculation any differently? But I think, the approach “you should learn it because other smart people use it” is a good motivation, but poor justification. It won’t take a lot of time to explain in what situations the method you want to demonstrate is the best approach (and maybe remind the students Perl motto). I guess, by the time your students reach BDA class, they understand enough how science and college education work to believe that you are explaining them your perspective (and teaching the skills) rather than trying to indoctrinate them.

4. Dustin Tran says:

As I am actually taking the Bayesian data analysis course at Harvard right now, I think the analogy to classic hypothesis testing won me over. Jun Liu jokingly described cross validation as the “computer science” way of thinking.

• Andrew says:

Dustin:

The funny thing is, when Xiao-Li, Hal, and I wrote our paper on posterior predictive checks (published in 1996, but the first version was written in 1993 or 1994), we thought of classical statisticians as being our main audience. From my perspective, the Bayesian p-value was a natural way to get a classical p-value in the setting where the point estimate was noisy and the test statistic was not based on a pivotal quantity. But we soon found that (a) the non-Bayesians typically had zero interest in anything that averaged over a prior distribution, and (b) Bayesians found the ideas very useful, in part I think because our framework gave them permission to check their models. So sometimes you just have to reach the audiences you can, and not worry about the rest.

• Keith O'Rourke says:

> (b) Bayesians found the ideas very useful
Some did and others varied from minorly to highly critical.

When you are teaching a not (currently) widely accepted approach, even if you think it should be, presenting it conditionally (while noting some lack of current wide acceptance) makes good sense. For instance, if they later run into the cross-fire (as I did once), they likely will understand the issues and realities better.

• @Dustin Tran: Much more economically put than my extended allegory (see above).

5. Giri says:

I am not sure I see posterior predictive checking and cross-validation as necessarily mutually exclusive.

Assessing the quality of a model by comparing its predictive implications against observed data seems in line with the “Statistics as Science” view I am fond of. However I am wary/skeptical about comparing data that was used to update the prior to draws from the posterior, since this seems inherently circular. (It seems like “double-dipping” the data set: once to fit the model, and again to validate it.) The apparent advantage of cross-validation is that a model is tested on data that was not used in fitting the model so this “double-dipping” is avoided.

How about a posterior predictive check embedded within a cross-validation approach: update the posterior using a subset of the data and then compare draws from the posterior predictive distribution to the remainder of the data? (And then iterate over partitions…)

• Andrew says:

Giri:

Posterior predictive checks are in chapter 6 of BDA3; cross-validation is in chapter 7. Both can be useful.

• Giri says:

Hi Andrew,

It looks like my previous suggestion was actually pointed out here by a professor in the comments, which I noticed the other day when I suggested someone in the Data Science course at Harvard to take a look. The point that had come up was whether Bayesians perform tests: it had been argued that testing was a “frequentist” concept when in fact this is not at all true, since a test can be one reasonable way to assess whether or not a probability model is any good, “frequentist” or not.