Yes, checking calibration of probability forecasts is part of Bayesian statistics. At the end of this post are three figures from Chapter 1 of Bayesian Data Analysis illustrating empirical evaluation of forecasts.

But first the background. Why am I bringing this up now? It’s because of something Larry Wasserman wrote the other day:

One of the striking facts about [baseball/political forecaster Nate Silver’s recent] book is the emphasis the Silver places on frequency calibration. . . . Have no doubt about it: Nate Silver is a frequentist. For example, he says:

One of the most important tests of a forecast — I would argue that it is the single most important one — is called calibration. Out of all the times you said there was a 40 percent chance of rain, how often did rain actually occur? If over the long run, it really did rain about 40 percent of the time, that means your forecasts were well calibrated.

I had some discussion with Larry in the comments section of his blog and raised the following point: There is such a thing as Bayesian calibration of probability forecasts. If you are predicting a binary outcome y.new using a Bayesian prediction p.hat (where p.hat is the posterior expectation E(y.new|y), then Bayesian calibration requires that E(y.new|p.hat) = p.hat for any p.hat. This isn’t the whole story (as always, calibration matters but so does precision).

The last time I took (or taught) a theoretical statistics course was almost thirty years ago, but I recall frequentist coverage to be defined with the expectation taken conditional on the value of the unknown parameters theta in the model. The calibration Larry describes above (for another example, see here and scroll down) is unconditional on theta, thus Bayesian.

I haven’t read Nate’s book so I’m not sure what he does. But I expect his calibration is Bayesian. Just about any purely data-based calibration will be Bayesian, as we never know theta.

Larry responded in the comments. I don’t completely understand his reply, but I think he says that that unconditional coverage calculations are frequentist also.

In that case, maybe we can divide up the coverage calculations as follows: Unconditional coverage (E(y.new) = E(p.hat)) is both a Bayesian and frequentist property. (For both modes of inference, unconditional coverage will occur if all the assumptions are true.) Coverage conditional on data (E(y.new|p.hat) = p.hat for any p.hat) is Bayesian. Nate is looking for Bayesian coverage. Coverage conditional on theta (E(y.new|theta) = E(p.hat|theta)) is frequentist.

What is different about Bayesian inference? In Bayesian inference you make more assumptions and then can make more claims (hence the Bayesian quote, “With great power comes great responsibility”). Frequentists such as Larry are wary (perhaps justifiably so) of making too many assumptions. They’d rather develop methods with good average coverage properties under minimal assumptions. These statistical methods have a long tradition, have solved many important applied problems, and are the subject or research right now. I have no problem with these non-Bayesian methods even if I do not often use them myself. When it comes to frequency evaluation, the point is that Bayesian inference is supposed to be calibrated conditional on any aspect of the data.

To return to the title of this post, yes, checking calibration of probability forecasts is part of Bayesian statistics. We have two examples of this calibration in *the very first chapter* of Bayesian Data Analysis. Calibration is also a central topic in treatments of Bayesian decision theory such as the books by J. Q. Smith and Bob Clemen. I think it’s fair enough to agree with Larry that these are frequency calculations. But they are *Bayesian* frequency calculations by virtue of being conditional on data, not on unknown parameters. For Bayesians such as myself (and, perhaps, for the tens of thousands of readers of our book), probabilities are empirical quantities, to be measured, modeled, and evaluated in prediction. I think this should make Larry happy, that frequency evaluation (albeit conditional on y, not theta) is central to modern Bayesian statistics.

**Not f or b**

Thus, it’s not Frequentist *or* Bayesian. Frequency evaluations are (or should be) important for all statisticians but they can be done in different ways. I think Nate’s doing them in the Bayesian way but I’ll accept Larry’s statement that Nate and I and other applied Bayesians are frequentists too (of the sort that perform our frequency evaluations conditional on observed data rather than unknown parameters). And I do see the conceptual (and, at times, practical) appeal of frequentist methods that allow fewer probability statements but make correspondingly fewer assumptions, even if I don’t usually go that way myself.

**The calibrations from chapter 1**

Chapter 1 of Bayesian Data Analysis. You can’t get much more Bayesian than that.

Andrew

I agree with you on mich of this, especially that both Bayesians and frequentists can (and should) be concerned about calibration.

A few questions.

1. You say: “But I expect his calibration is Bayesian.” Why do you think this?

2. “A 90 percent prediction interval, for instance, is supposed to cover 90 percent of the possible real-world outcomes,

If the economists’ forecasts were as accurate as they claimed, we’d expect the actual value for GDP to fall within their prediction interval nine times out of ten”

Don’t you agree that this sounds frequentist?

3. I don’t agree that the calibration I describe is averaging over theta. There is no theta here!

Every day it rans or doesn’t rain. We observe that. Every day the weatherman gives a number. We observe that.

No models, or parameters. We’re just comparing long run frequencies to numbers.

This is prediction not estimation.

4. The usual definition of frequentist coverage (for confidence intervals)

Coverage = infimum P_theta(theta in C(X))

for a set of distributions P_\theta indexed by theta.

There is no conditioning involved.

That’s all.

Thanks for the interesting post

Best wishes

Larry

Hi, Larry. In quick response:

1. I expect Nate’s calibration is Bayesian because it is conditional on the data used to make the predictions. I imagine, for example, that he would find it legitimate to examine his calibration just for southern states, or just for states that Obama won in 2008, or any other condition chosen without reference to the new data.

2. Yes, I agree this is frequentist. It’s also Bayesian. The concept of calibration is a frequentist concept that is important to Bayesians (as indicated by its prominent position in the very first chapter of Bayesian Data Analysis).

3. There are some thetas in Nate’s analysis, for example the adjustment factors for different survey organizations. In this case, however, I assume the thetas are estimated so accurately that we’re in the asymptotic scenario where the conditioning on these parameters doesn’t matter.

4. Thanks. I think it’s good that we can have these discussions in a better spirit than some of the statistical debates of the past decades!

Yes indeed. The more discussions the better.

Best wishes

Larry

I think Andrew has the most clear, sensible, and accurate comments on Larry’s blog.

It could be that Larry is also correct, but it’s harder for me to decipher. For example, I could use a clarification with respect to Larry’s second question above. Based on his response to a question I raised on his post on “What Is Bayesian/Frequentist Inference”, it seems like what precisely can be achieved by prediction depends on whether the method is Bayesian or Frequentist. The Bayesian’s prediction only passes this calibration assessment if it happens that his prior over the parameters was reasonably accurate, while the Frequentist achieves this (or at least, is supposed to) regardless of the value of the (fixed) parameter.

The ambiguous nature of the classification stems from the fact that Andrew is speaking of Gelman-Bayes, and Gelman-Bayes, at least at times, calls for frequentist “performance” or error probability assessments and checks. But it confuses the issue to mix Gelman-Bayes with what is meant in the Bayesian/frequentist discussion that Larry has in mind. If the underlying justifications are going to turn on frequentist performance or the like, then it is confusing to say it has Bayesian foundations ,without qualifying that one is using terms in a special way. There may be technical methods from all kinds of sources that serve those end. Standard frequentist statistical methods are expressly designed to do so, but methods from other schools could be found in fact to serve those ends, at least in various cases.

Mayo:

If you (or Larry) wants to exclude calibration from the permissible set of Bayesian operations, you’ll have to not just exclude Nate, me, and everyone who’s used my book as a template. You’ll also have to exclude the entire Bayesian decision analysis literature, as I noted above, where ideas such as calibration and proper scoring rules are central to completing the circle of empirical evaluation of probabilities.

Larry, recall defined Bayesian inference by its aim, which he describes as updating degrees of belief. That, I take it, is not your description.

I think the philosophy of statistics is much more productive when attention is placed on the state of the art (such as Gelman-Bayes), rather than on past (as in abandoned, or at least soon to be) schools of thought. If this removes some of the controversies from the discussion, that’s progress!

I am confused by point 3. Is Larry saying that weather prediction (and by implication Silver’s voting predictions) is done in a model-free way? Surely not?

Also, what is this distinction between prediction and estimation? As far as I can tell the distinction only exists if you accept a distinction between parameters and random variables. Which of course is not the case in a Bayesian context.

Konrad:

That’s right: in frequentist statistics the distinction between parameters and predictions is very important: All frequency properties are defined conditional on parameters, but not on predictive quantities. For Bayesian there is often a conceptual difference—parameters generalize to new contexts, predictions typically do not—but this difference just comes up in the way that these different quantities are used in a model, there is no formal distinction.

To me, the unity of parameters and predictive quantities is a strength of Bayesian statistics, a strength that shows up even more in hierarchical modeling, where frequentist notions can break down. For example, in the 8 schools, are the school effects “parameters” or “predictive quantities”? If the former, what happens if we only care about their distribution and not the individual schools? If the latter, what happens if we only have 3 schools? 2 schools? Frequentist inferences for the school effects can change dramatically (depending on what frequency principles are being applied() based on these choices.

On the other hand, to a frequentist, this distinction between parameters and predictive quantities (that is, between estimation and prediction) can be viewed as a strength in that it is another “degree of freedom” in inference and allows one to adapt general principles to specific cases. Where the Bayesian sees coherence, the frequentist seems a too-strong set of assumptions. And what the Bayesian perceives as incoherence, the frequentist sees as realism.

I actually have a full post on this in the queue.

Don’t you still need a distinction between parameters and hyperparameters. And how can you be sure the prior distribution of the parameters generalizes to different contexts. One can always take the generalization problem one level up as it were.

But instead you can draw the distinction between observables and non-observables! We estimate non-observables, but predict observables.

Having not read The Signal and the Noise, I have to say that in a forecasting setting, it seems to me that unless one is adhering to a doctrinaire de-Finettian line, the question of whether one’s methods are frequentist or Bayesian becomes extremely murky.

All this talk of calibration makes me think that Silver’s views and methods might be best categorized as prequential. As far as I understand it (not very far!), the prequential framework is based on the principle that any criterion by which a prediction method is judged ought to ignore everything other than the actual predictions issued and the actual outcomes realized. This principle permits some “frequentist” criteria for judging prediction methods but disallows others; and Bayesian prediction can be shown to optimize some prequentially permitted criteria but not others.

I think that a major problem here is that both terms “frequentist” and “Bayesian” are used by many people in a rather ambiguous fashion. I’d use “frequentism” in order to refer to an interpretation of probability that is based on long run relative frequencies and therefore locates the concept of probability in the outside world (this does not claim that “objective probabilities really exist” but rather that a frequentist will think about the world *as if* they did, knowing that this is an idealisation the usefulness of which has to be demonstrated). “The opposite side” (although there is more than one other side really) in the foundations of probability then is “epistemic probability”, which locates probabilities in human thinking (which is an idealisation, too).

I think that this is broadly consistent with Larry Wasserman’s view.

Many people discussing Bayesianism really discuss epistemic concepts of probability, according to this terminology. Somebody using Bayesian methods can indeed be a frequentist, although it is usually hard to give parameter distributions a frequentist meaning (note that the main example in Bayes’s original paper suggests such a meaning), and traditionally most Bayesians (de Finetti, Savage, Jaynes) have adhered to epistemic interpretations of probability.

Whether one’s probability interpretation should be epistemic or frequentist is a different discussion from whether one should rather use Neyman/Pearson/Fisher-type or Bayesian methods for inference in science, but it is not totally independent, because, as said before, a frequentist may find it hard to make sense of distributions on the paremeter space.

A major problem for me is that in Bayesian journal papers (some books are a bit better), there is a tendency to avoid specifying what the prior is supposed to mean and it cannot be seen whether the probability concept used is rather frequentist, epistemic or mixed up, and consequently it is unclear how to interpret the results.

Well put Christian

No disagreement, but it’s worth taking it one step further. Sure Bayesian probabilities are epistemic, but they reference events or data in the real world. One example might be “Obama wins in 2012”. Another example might be “3 out of the next 5 coin flips will be heads”. If a Bayesian uses events like the later example either as data inputs or as something to predict then there is necessarily going to be some similarities between their methods and those of Frequentists.

Even in those cases though, Frequentists and Bayesians aren’t really doing the same thing. A Frequentist will imagine that P(heads)=.5 for the probabilities of heads in a fair coin because is approximately the percentage of times of heads appears. A Bayesian will say, given an appropriate k1, that P(heads|k1)=.5. It’s possible though that for the exact same coin flip a physicist has taken some additional measurements and has a state of knowledge k2 about the initial conditions of the coin flip. In that case the physicist may say P(heads|k2)=.98. This probability is no longer equal to the long range percentage of heads, but that’s ok for a Bayesian. In a sense both P(heads|k1)=.5 and P(heads|k2)=.98 are correct, it’s just that the physicists answer is more useful for predicting the next coin flip. I stress here that the Frequentist, Bayesian, and Bayesian Physicist are all referencing the exact same physical event using the same fair coin.

I suppose a Frequentist would try to save themselves by saying that P(heads|k2)=.98 is the approximate frequency of heads whenever k2 is true. Maybe, maybe not. In general this won’t be true and in many cases it won’t even be possible to repeat k2 even in principle. In those cases a Frequentist will just give up and say Probability can’t be applied at all. But all this Frequentist rigmarole is completely unnecessary. P(heads|k2) = .98 is roughly saying “my knowledge k2 implies that the true initial conditions of the coin when tossed has been narrowed down to a set of initial which have the property that 98% of them lead to heads”. This is a perfectly meaningful statement even if “k2” can never be repeated in practice or in principle. Although it’s difficult to explain in a blog comment, this is very much analogous to the situation faced by a Bayesian predicating the outcome of the 2012 presidential election.

Bottom line: If Bayesian’s use, or are trying to predict, frequencies, then there will necessarily be similarities them and Frequentist. But they are not the same. What a Bayesian is doing is more general, flexible, and useful in more situations. If you disagree, then consider this: have you ever seen a Frequentist claim the probability of heads for fair coin was something other than .5?

Which begs the question, if k2 never happened where did you get the knowledge to state P(h|k2)=0.98? Theory, perhaps, but even that is typically based on observation. I like Bayes but frequentists also know how to extrapolate.

What is the probability that aliens from Nebula 1523, should they exist, yawn? I see no reasons why both Bayesians and frequentists cannot answer this question given some assumptions.

In practice you would get it from partial measurements of the initial conditions. Those measurements would not however narrow the initial conditions down to an exact answer (i.e. you would know the initial point in phase space to a small subset, but that subset would still have more than one element). You then count the fraction of initial conditions that lead to heads (using the Classical Mechanics for rigid body motion)

This kind of thing is far from hypothetical. Such measurements have been used in real life to predict sequences of cards from light shuffling and for beating roulette wheels.

I agree, my point was subjective probability does not typically arise from a tabula rasa. And if you can measure it and observe it you can count it. Frequentists ain’t so challenged.

Actually if a Frequentist thought hard about what I was saying they would find it meaningless. Indeed it has nothing at all to do with Frequencies and is perfectly meaning for singular events which are never repeated. Nor is it in any way “subjective” if we really do know what subset the true initial conditions lies in.

Saying that 98% of the initial conditions leads to heads is NOT the same as saying “in a repeated experiment that 98% of the time you’d get heads”. The latter requires a much stronger assumption. In other analogous instances it wouldn’t be possible to repeat at all even in principle (as in the 2012 election example)

But the real point of this example was following three things:

(1) probabilities depend on a state of knowledge (or model assumptions if you like),

(2) two different, but equally true, states of knowledge can lead to very different probabilities assignments

(3) it’s quite possible for the best assignment (best in terms of predictive ability or in terms of which used more information), to be very different from actual long range frequencies.

collectively these are a challenge for Frequentists.

Anonymous: A frequentist would want to estimate the probability of heads. It’s not frequentist to claim without observations that it’s 0.5, 0.98 or whatever. (Of course I can say that a *fair* coin has a probability of 0.5 of heads if that’s my *definition* of “fair”.) The frequentist has no problems with having a model that implies that there *is* a true probability for heads without having seen observations yet, and the frequentist is perfectly happy with some physical information that suggests that this may be about 0.98. However, the frequentist would still want to test what the physicist is saying (or compute confidence intervals etc.).

However, this is just the good old Bayesian vs. frequentist debate, isn’t it? It seems that Andrew is indeed a frequentist in the sense that he’d give the probabilities in the sampling model a frequentist meaning, not an epistemic one, despite being a Bayesian in terms of the methods that he likes to apply.

Christian,

As I wrote above, I think that the calibration I do is both Bayesian and frequentist. “Bayesian” and “frequentist” are not mutually exclusive categories. My calibration is Bayesian in that it is based on posterior probabilities averaging over the posterior distribution of the model conditional on observed data (rather than trying to get coverage conditional on unknown parameters as in non-Bayesian inference). My calibration is frequentist in that it averages over future replications.

Regarding the interpretation I give to probabilities, see Chapter 1 of BDA.

The true frequency of heads is .5. Given this knowledge and nothing else a Bayesian will say P(heads|k1)=.5 for the next coin flip. Frequentists will look at this and say “see Bayesians are secretly Frequentists”. Not so, because a more informed Bayesian will take additional measurements k2 and determine P(heads|k2)=.98 which is very different from .5 (the true value for a Frequentist).

The point is that a lightly informed Bayesians whose analytical inputs and outputs consists of information about real frequencies is bound to resemble a Frequentist analysis to some extent. If they have other information, their analysis will no longer resemble Frequentists.

How is this possible? Well consider the true sequence of coin flips outcomes O_1,…,O_n, This true sequence will have the property that there are about as many heads as there are tails. But this true sequence is only one in a set of possible outcomes with 2^n elements. Any probability distribution P on this 2^n set will be considered correct by a Bayesian if the true sequence lies in the high probability manifold of P. This is the essence of Gelman’s posterior probability checking (i.e. simulated sequences from P many times and see if the true sequence looks like the typical simulated sequence)

But there are lots of ‘true’ probability distributions by this definition, which can differ considerably in the size of their high probability regions. The lighly informed Bayesian will use a uniform distribution and derive marginal distributions of the form P(heads|k1)=.5.

A more informed Bayesian will use a P which is sharply concentrated about the true sequence O_1,…,O_n (assuming thier information is accurate). This distribution will have marginals P(heads|k2) that differ considerably from .5 for individual coin flips.

The Frequentist rejecting the Bayesian definition for a correct P will then jump through completely irrelevant, made up, hoops trying to interpret P itself as some kind of Frequency. If they can’t do so even with some asinine made up model (such as imagining multiple replications of our universe) they will simply deny that P is meaningful and proceed to invent some specialized ad-hoc method for doing the same thing.

Incidentally, for the 2012 election there were about 100m voters. Each “outcome” for the election can be thought of as a sequence of votes {obama, romney}. So any election model can be thought of as a probability distribution on a space of size 2^{10^8}. If you use the uniform distribution, Obama will have a 50% chance of winning and the high probability region will consist of the entire space.

Now at some point Nate had information which lead him to believe the probability of an Obama win was 75%. So by how much did “Nate’s information” effectively shrink the high probability manifold? It turns out this can be calculated approximately pretty easily if you neglect the electorial college (answer: the effective size of Nate’s high probability manifold is about 5.7 million times smaller than for the uniform distribution). That’s a quantiative estimate for how much “information” Nate used.

Christian: I think it is a big mistake to equate frequentist statistical methods with the use of frequentist probability. (That is why I prefer to talk of methods of statistical inference based on sampling distributions (sampling theory), in talking of frequentist inference. A Bayesian could call for direct frequentist assignments to hypotheses, as Reichenbach tried, and they would not be frequentist statisticians.

Frequentist statisticians use sampling distributions to obtain error probabilities (Larry, performance or coverage probs), and the point of frequentist sampling theory—as a theory of inference–is that probability qualifies inference by means of error probabilities. This is NOT true for frequentist Reichenbach.

In addition, merely using probability to model phenomena, games of chance or the like, is distinguishable from inference (either to models or parameters within them). Larry is talking about frequentist statistical inference (sampling theory or, as I prefer, error statistics), otherwise, just for one thing, he wouldn’t be distinguishing using “Bayes theorem” from doing Bayesian inference.

I also think it is a huge confusion to suppose that using probability for inference/knowledge/evidence—i.e., using it for epistemological ends—is tantamount to using an epistemic notion of probability as you are using it (a degree of belief).

Mayo and Andrew: No disagreement here. At best terminological: Mayo is of course right that frequentist probability and the frequentist statistical inference I assume she is talking about are not the same. I’d prefer to use a different term for the latter (probably “error statistics” works nicely), reserving the term “frequentist” for the former. As far as I got it from Larry’s posting, Nate Silver quite clearly used the former, but (as Andrew noted) not necessarily the latter.

I’ve been following this discussion on Larry and Andrew’s blogs with interest. I used to do probabilistic combinatorics research and float happily above these issues, content that probability is anything that satisfies the right axioms. But now that I’m doing health research, it seems much more important for me to pin down. I think I’m using Bayesian methods, but I think I’m very interested in calibration.

Can some of the experts involved in this thread help me understand how this connects to the calibration/reliability approach Dawid used in _The Well-Calibrated Bayesian_ (http://www.jstor.org/stable/2287720) and related work? My impression is that he provides a different (non-frequentist, non-bayesian) approach, that could be more compatible with the culture of machine learning.

Abraham

The field of calibration has changed a lot since the time when Phil Dawid wrote those papers.

I suggest doing a google search on papers by Dean Foster, Jacob Abernethy, Elad Hazan

to get a modern view.

Larry

And, if you’re interested in a modern Bayesian view, I recommend Chapter 1 of Bayesian Data Analysis.

[…] before the recent rise of modern Bayesian statistics. [This is a point that Larry Wasserman and I discussed recently — AG]. . . […]