Our recent discussion with mathematician Russ Lyons on confidence intervals reminded me of a famous logic paradox, in which equality is not as simple as it seems.

The classic example goes as follows: Abraham Lincoln is the 16th president of the United States, but this does not mean that one can substitute the two expressions “Abraham Lincoln” and “the 16th president of the United States” at will. For example, consider the statement, “If things had gone a bit differently in 1860, Stephen Douglas could have become the 16th president of the United States.” This becomes flat-out false if we do the substitution: “If things had gone a bit differently in 1860, Stephen Douglas could have become Abraham Lincoln.”

Now to confidence intervals. I agree with Rink Hoekstra, Richard Morey, Jeff Rouder, and Eric-Jan Wagenmakers that the following sort of statement, “We can be 95% confident that the true mean lies between 0.1 and 0.4,” is not in general a correct way to describe a classical confidence interval. Classical confidence intervals represent statements that are correct under repeated sampling based on some model; thus the correct statement (as we see it) is something like, “Under repeated sampling, the true mean will be inside the confidence interval 95% of the time” or even “Averaging over repeated samples, we can be 95% confident that the true mean lies between the lower and upper endpoints of the confidence interval.” Russ Lyons, however, felt the statement “We can be 95% confident that the true mean lies between 0.1 and 0.4,” was just fine. In his view, “this is the very meaning of “confidence.'”

This is where Abraham Lincoln comes in. We can all agree on the following summary:

A. Averaging over repeated samples, we can be 95% confident that the true mean lies between the lower and upper endpoints of the confidence interval.

And we could even perhaps feel that the phrase “confidence interval” implies “averaging over repeated samples,” and thus the following statement is reasonable:

B. “We can be 95% confident that the true mean lies between the lower and upper endpoints of the confidence interval.”

Now consider the other statement that caused so much trouble:

C. “We can be 95% confident that the true mean lies between 0.1 and 0.4.”

In a problem where the confidence interval is [0.1, 0.4], “the lower and upper endpoints of the confidence interval” is just “0.1 and 0.4.” So B and C are the same, no? No. Abraham Lincoln, meet the 16th president of the United States.

In statistical terms, once you supply numbers on the interval, you’re conditioning on it. You’re no longer implicitly averaging over repeated samples. Just as, once you supply a name to the president, you’re no longer implicitly averaging over possible elections.

So here’s what happened. We can all agree on statement A. Statement B is a briefer version of A, eliminating the explicit mention of replications because they are implicit in the reference to a confidence interval. Statement C does a seemingly innocuous switch but, as a result, implies conditioning on the interval, thus resulting in a much stronger statement that is not necessarily true (that is, in mathematical terms, is not in general true).

None of this is an argument over statistical *practice*. One might feel that classical confidence statements are a worthy goal for statistical procedures, or maybe not. But, like it or not, confidence statements are all about repeated sampling and are not in general true about any *particular* interval that you might see.

**P.S.** More here.

The true mean lies between 0.1 and 0.4 with either probability =0 or probability = 1.0. To assign any possible value besides zero or unity to this probability is nonsense.

Bruce:

All sorts of nonsense can be useful. Imaginary numbers are nonsense—what could it possibly mean to be the square root of -1?—but they can help solve lots of real problems. In psychometrics we speak of “ability” parameters: that’s nonsense too but it’s useful nonsense. Constructs such as quarks and electrons can be considered as latent constructs as well. I can talk about Pr(Spurs win the 2017 championship) and that’s a perfectly reasonable prediction. After the playoffs are over, this probability becomes 0 or 1—but only if I know the outcome. In the example of the “true mean,” the probability similarly becomes 0 or 1 once the true value is revealed. But that may never happen. I don’t want to live in a world in which probabilistic forecasts, or imaginary numbers, or ability parameters are disallowed because of some philosophical objection.

I think you’re both talking to the same thing. Try out the following re-statement of the issue:

A. The pre-experimental probability that the confidence interval will contain the true parameter value is 95%.

B. The pre-experimental probability that [0.1, 0.4] will contain the true parameter value is 95%.

Statement A is the usual statement about confidence intervals and expresses the probability that a given experiment (before it is run) will generate a confidence interval that contains the true value. Statement B is false. [0.1, 0.4] contains the true parameter value with probability 0 or 1. It’s either in the interval or it’s not. We just don’t know what state is true (I feel like there should be a Schrodinger’s cat analogy here…for another cat reference!).

+1 to Jim — a crisp way to clarify a common confusion.

Andrew,

I used the word “probability” in my statement, not confidence, and perhaps I should have clarified that my position (in this case) is frequentist and I am necessarily invoking the law of the excluded middle; either the true mean is in the interval or it is not.

Imaginary numbers are not nonsense. Think of phase shifts. For more on the not-nonsense of imaginary numbers, Paul Nahin’s fine book is a great read. :)

Regards,

Bruce

Bruce:

Imaginary numbers are mathematical constructs that allow us to solve real problems. Similarly, probability statements for parameter values are mathematical constructs that allow us to solve real problems. As Efron/Morris, Rubin, Agresti/Coull, and many others have demonstrated, Bayesian inference has been used to develop statistical methods with good frequency properties.

Somebody, I can’t remember who, once said something to the effect of “God created the natural numbers; man invented the rest.”

I think it is fair to describe the history of mathematics as the incremental creation of abstract nonsense that solves problems that are otherwise unsolvable. Negative numbers and fractions are “nonsense,” but they were constructed to extend the basic properties of natural numbers and permit the solution of equations that have none within the natural numbers themselves. And on and on it goes.

It was Kronecker who said “The integers alone are created by God; all else is the work of Man,”

Man invented the pie, but pi is the work of God.

(actually, more likely woman)

Bruce, you might want to look into this whole “Bayesian probability and statistics” thing…

Ouch. :) Above I confess to the error of not clearly asserting a frequentist posture _for this example_ !

Bruce: Your statement is obviously the standard frequentist interpretation. That being said, we often treat non-random events as being random. For instance, when tossing a coin I can say that the probability of heads, ex ante, is equal to 1/2. But of course, given the location of the coin in my hand, the force I exert on the coin etc, the outcome is possibly deterministic [insert long-winded discussion on quantum mechanics here]. As such, even before throwing the dice, the outcome is not random. But for practical purposes we can treat it as random, not because it is necessarily true but because it is useful.

In the same way, even though the population parameter is fixed, I am lacking some information which means that I may treat the parameter as random. And it is not clear to me that it is not useful to do so even if I am using frequentist methods. Because as a researcher, I am not interested in what would happen if I continually sample in different realities. I am only interested in the population parameter for the only sample I have available. And it seems important that I should be able to express my uncertainty about the parameter given my sample.

When I was teaching I had a freshman/sophomore honors college course, taught as a seminar, and on the first or second day I did the following experiment with the class:

I drew a coin from my pocket and asked, “what’s the probability that it will come up heads when I flip it?” The class discusses this and everyone says 50%.

I flip the coin so that it falls on the floor and before seeing what came up I put my foot on the coin. So none of us knows how the coin came up. I ask again, “what’s the probability that the coin is showing heads?” More discussion. Most of the class will say 50%, but a few (usually students who have taken AP stats in high school) will say it’s either 0 or 1 but I don’t know which. This shows a divergence between a Bayesian and a frequentist interpretation of probability, which I exploit briefly by asking about betting on whether it’s heads or tails…even the ones who said that the probability is 0 or 1 but I don’t know which are still willing to take an even-money bet on its being heads.

I then (out of sight of the students, the coin is on the floor and they can’t see it) peek at the coin and determine how it came up. I then ask again, what’s the probability that it’s heads? Again a dichotomy, although not everyone who said 50% after the second round is sure that that’s the right answer now. But again, everyone is willing to take the even-money bet. (You can always allow them to bet on tails at even money as well, of course).

I then tell them (truthfully) what I saw, and I ask again, what’s the probability that it’s heads. This causes something of a conundrum since the class now has to guess whether I’m telling the truth or not. Regardless of whether I’ve seen heads or tails, few are confident enough to say 0% or 100% probability (whichever is appropriate), but no one is willing to say 50% either! Nor is anyone willing to offer or take an even-money bet.

I then invite one student to look at the coin and say what that student saw. With only one exception, the student told the truth and agreed with me, and then most of the class (not all!) were willing to go with 0% or 100%, though some were still cautious and only move their estimate in the appropriate direction.

I do this at the end of the class, and the students are invited to look at the coin themselves as they leave.

As I said, there was one exception, a student who said that the coin was showing the opposite of what I had said and what was in fact showing. This happened the very first time I gave this class, and I was in fact very pleased that this student did this because the discussion that ensued was very interesting. That student went on to turn down a Rhodes fellowship, to study stats at Cambridge under a Marshall scholarship, earned a PhD in stats here in the states, and was awarded tenure a few years ago. I am enormously proud of him.

In any case, this experiment gives students a sense of the distinction between Bayesian and frequentist interpretations of probability, and allows me to start out this course on Bayesian decision theory with a concrete example that helps them sort it all out. Discussing probability in terms of bets on this real-life example also introduces that approach to defining probability. [Note that this was a non-calculus course that has been taken by students with all sorts of majors…most of them are not mathematicians, there are usually a few pre-med students and since some of the examples I give later in the course are from medicine they find useful things there…also pre-law, but many other majors as well. I even had one dance major early on, and she did just fine.]

No, this is absolutely not true. The true mean either lies between 0.1 and 0.4 or it doesn’t. This is a true fact. But *probability* is a statement about a *state of information*. So, if your information is “the true mean is .25” then “the probability that the true mean lies between 0.1 and 0.4” is exactly 1 under this state of information.

On the other hand, if your state of information is “the true mean definitely lies between 0 and 0.5 and this is all I know” then you can construct a probability distribution uniform(0,0.5) for the true mean, and “the true mean lies between 0.1 and 0.4” has probability 0.3/0.5 under *this* state of information.

In fact, Bayes can’t tell you how to construct the probability distributions that represent your initial state of information, it only tells you how to update them to condition on additional information afterwards.

I’ve never quite understood frequentist confidence intervals, which is one reason I aspire to be Bayesian. So I may be wrong on this, but my understanding of, “Under repeated sampling, the true mean will be inside the confidence interval 95% of the time” is that we really mean (being more explicit): “Under repeated sampling, the true mean will be inside each sample’s confidence interval 95% of the time”. If that is correct, that adds an additional level of confusion when people look at _a_ CI and think of it as _the_ CI.

Assuming random sampling and that any distributional assumptions are met, for 95% of the samples the generated confidence interval will include the true value.

And using scary voice in my class I say … and you cannot know if you have one of the “good” samples or not.

You just know that overall 95% of tall possible samples have confidence intervals containing the “true value” and 5% do not.

I like the _a_ versus _the_ distinction. We have _the_ CI from our specific sample, but it is just _a_ CI among the millions of possible samples.

If Data(b,i) is a function that constructs a sequence of samples of data indexed by i using the actual random process assumed by the CI function, and [a,c] = CI(D) is a function that constructs a 95% confidence interval for b from a data set D, and Contains([a,c],b) is a function whose value is 1 when a < b < c and 0 otherwise, then it is a true statement that

lim_N_goes_inf( sum_over_i(Contains(CI(Data(b,i)),b))/N ) = 0.95

this is the precise mathematical meaning of coverage. Note that for every Data(b,i) there is a DIFFERENT CI that is constructed.

Also note, unless you really are sampling from a large fixed population by use of a random number generator then the assumption that “Data(b,i) constructs a sequence of sample for data… using the actual random process assumed by the CI function” is false. and so the whole conditional statement has the logical structure

if FALSE then SOME STATEMENT

which is conventionally taken to be a true statement regardless of whether SOME STATEMENT is true or false.

It therefore offers basically no information about SOME STATEMENT.

:-)

OK but the other implicit assumption here is no systematic error.

Good thing there is seldom a lack of awareness and or agreement on the implicit assumptions ;-)

I don’t use a scary voice, but put things in a “good news, bad news” framework: The good news: we can do this procedure which (assuming model assumptions fit) does what we’d like for 95% of suitable (as specified by model assumptions) samples.

The bad news: We don’t know whether or not the sample we have is one of the 95% for which the procedure works (i.e., gives an interval containing the true parameter), or one of the 5% for which it doesn’t.

Are confidence intervals and “reasonable probabilities” the same in election models?

http://election.princeton.edu/2016/11/06/is-99-a-reasonable-probability/

I have to say I am getting tired of what I think is an unproductive discussion about the technical meaning of a confidence interval. Sure, statement C is not correct, but A and B are not particularly useful. Generally you have one sample, and while you do not know if you have a “good one” or not, you can say that the probability is 95% that you have one of the good ones. I am well aware of the technical definition of a confidence interval (repeated sampling) and I am also aware that the true parameter either is, or is not, in the confidence interval (probability = 0 or 1), but I don’t find that distinction particularly useful in practice.

I also know that particular cases can be constructed where the casual (and wrong) interpretation “I am 95% confident that the true mean lies in the interval” can go awry. But those cases are unusual and I think it does a disservice to let particular unusual cases undermine any possible practical use of a confidence interval.

It also seems to me that Bayesians (I can’t claim to be one since I was not trained that way, but I do believe that is the right approach) unduly emphasize the incorrect interpretations of a confidence interval to make the point that we should all be Bayesian, not frequentist.

Andrew, your response to Bruce above seems to contradict this post. You can’t have it both ways – tell me that a confidence interval from one particular sample should not be interpreted probabilistically, and at the same time, argue that such incorrect interpretations can be useful. I tend to agree with the latter statement. I think it is more productive to incorrectly interpret the confidence interval (version C), and then emphasize the shortcomings of failing to use any prior information. It think that provides a somewhat useful statement about what the sample does tell you, while conveying how much is lost when no prior information is used.

I hope I have not rambled too much – but I am not finding the technical objections to version C to be at all productive. In the absence of repeated samples, and in the absence of a properly specified prior, C describes a practical and useful interpretation of what you can say from the one sample you have. I believe that is better than nothing, which is what A gives you.

To me the relevant distinction is that we commonly have external information to judge whether we have a ‘good’ sample (ie what we already know about the quantity of interest). In Bayesian analysis that info goes into the prior, but even if one is being a frequentist that doesn’t mean it should be ignored. So even though this can be seen as a technical distinction it is tied fairly closely to some of the arguments for a Bayesian approach.

If one doesn’t understand under what conditions a confidence procedure might emit nonsense, how can one feel comfortable using

anyconfidence procedure?+1

“I am 95% confident that the true mean lies in the interval” … “practical use.”

I can’t practically use a confidence interval without some sort of semantics, even if informal. There’s a number (.95) for _something_. Frequentists would insist it’s not a probability (“that the true mean lies in the interval”) since that’s meaningless. Bayesians might say that this has meaning, but the confidence interval approach is in no way trying to estimate probabilities of truth (and the “particular cases” you malign just show that confidence intervals aren’t able to do this sensibly, which they is fair since that’s not what they are even trying to do).

So I’m left with “confidence”. It’s not probability, and it’s not fair to impute some technical definition to the normal English word “confidence” just because that’s what some early statistician pulled up when naming the concept. No one thinks it makes sense as a probability. So how should I interpret it? I mean, practically, and granting maximum generosity to practicality over formality -? Something that is useful … maybe how it influences a decision I make or a belief I form?

“Confidence” seems like a really poor choice of word to use to label the concept. But I don’t know if I can think of a better one. Even though I generally don’t like using someone’s name to label something, that would be better than “confidence” or some other ordinary word that is likely to promote misinterpretations.

Maybe “reliability interval” would be better? “95% reliability” meaning it does what is intended 95% of possible times it could be used?(Still requires explanation — but not as bad as “confidence interval”.)

“Realized CI” is terminology that is sometimes used to distinguish CIs calculated using sample data from unrealized CIs, i.e. the CI procedure. I think it’s pretty clear.

And (for “standard cases” at least), realized CIs do have a formal interpretation – see the paper by Mueller & Norets and the Casella paper on conditional inference, both cited below. It’s rather different from the interpretation of an unrealized CI (coverage, repeated samples, etc.) but it’s still legitimate.

In other words, the mistake isn’t giving realized CIs an interpretation when none is possible. It’s using the wrong interpretation when a legitimate one is possible.

What I would really like to have is something I can point to that gets across the intuition of how to interpret a realized CI using this formal literature but in a simple and accessible way.

Here’s a silly but clear procedure I ran across somewhere (can’t recall where offhand, sorry) that is guaranteed to give you a 95% CI with correct coverage. It also nicely illustrates the difference between CIs and realized CIs:

(a) 95% of the time, you say your CI is the entire real line. (b) 5% of the time, refuse to guess.

Voila! Coverage is 95% because 95% of the time your CI will include the true value.

Pretty silly but makes the point. The procedure gives you 95% coverage, as a standard frequentist CI should. But the realized CIs in this case tell you nothing. And if your realised CI happens to be (b), it’s guaranteed to be wrong even though the CI procedure is “right” (correct coverage etc).

I think the relevant concept used here is “bet-proof” CIs. Will look for the ref and post separately.

— J. Neyman

Corey, you go to the source of the confusion. Neyman and Fisher had a few exchanges on this topic and they never made any headway towards convincing each other of their respective cases. What if we put the notion of intervals aside for the moment and looked at where the underlying problem lies?

In general if I am confident in a statement it might be because I am confident that the statement is true (i.e. I have a high partial belief in the statement). While that kind of confidence is natural, it is not the type of confidence engendered in a Neyman confidence interval. Neyman’s confidence has a less direct source. Consider that I might say I’m confident in a statement because I trust the person or system that stated the statement. I could be entirely agnostic about the substance of the statement itself, but still truthfully say I have confidence in the statement. That’s Neyman’s confidence because the statistical method is the stater of the statement. Some might feel that Neyman’s confidence is unnatural and unconvincing (I occasionally feel that way) but it is the only type of confidence that fits within the frequentist framework where probabilities are long run frequencies rather than states of partial belief.

Fisher was accused of trying to make a Bayesian omelet without breaking Bayesian eggs when he was pushing his fiducial probability, and there seems to be truth in that. However, if we want confidence that relates to the normal high partial belief then we have to use probabilities on the partial belief scale rather than the long run frequency scale. Fisher’s abhorrence of priors might have led him to be unable to accept that he was using a belief scale of probability.

All of that means that it may be impossible to clarify the nature of the arguments about the meaning of confidence intervals without dealing with the two versions of probability. I will provocatively suggest that a contributory reason for the lack of clarity in discussions such as this is that frequentists are reluctant to expose the fact that their preferred probability scale is not the most natural one to use in this situation.

I think the issue is more abuse of terminology. The confidence interval properly defined is a Radom interval. The one calculated from the sample is more of a simulation of the confidence interval but for some reason we called them CIs anyway.

Peter, I agree. It is unfair on non-expert users of statistics who are expert users of English to regularly deal with the subtle distinction between ‘confidence’ followed by ‘interval’ and ‘confidence’ followed by any other word. However, I’m not sure that simply calling Neyman’s confidence intervals “random intervals” is a good enough fix. What is the meaning of “random” in that context?

Rather than assuming we can fix the confusion by changing names of statistical objects, we need to find where the underlying difficulties lie and deal with them.

Agreed, the confusion occured despite concepts being well defined.

In my opinion much of this has to do with the way statistics was taught, due to practical constraints, statistics is assessed in an exam environment where there is only one correct answer (or a limited number of talking points each worth partial credit). The best way to learn statistics is to defend statistical analyses and realise on your own that there is a weakness to every method. All this talk of programming and pretty visualisations are just learning distractions (but important when you actually have to do statistics!).

+1

Andrew,

Thanks for this post. I think it provides probably the most accessible and straightforward (i.e., easiest to understand for non-statisticians) explanation of why the common interpretation of confidence intervals is incorrect. I will be using it with my students.

+1

I do believe there is a need for clearer communication about collectives as opposed to instances as well a the implicit given that the model’s assumption’ are if fact not too wrong.

To flesh this out, think of A & B versus C in terms of the classic example of “empty gasoline drums”

“Thus, around a storage of what are called “gasoline drums,” behavior will tend to a certain type, that is, great care will be exercised; while around a storage of what are called “empty gasoline drums” it will tend to be different-careless, with little repression of smoking or of tossing cigarette stubs about. Yet the “empty” drums are perhaps the more dangerous, since they contain explosive vapor.” http://web.stanford.edu/dept/SUL/library/extra4/sloan/mousesite/Secondary/Whorfframe2.html

Now chemists, fire fighters, etc. will well understand the extra risk of gas vapors and not make this mistake.

Similarly those with adequate training and experience in statistics won’t actually miss-interpret C.

But most users of statistics?

And many users of statistics will miss-interpret B (as for instance naive users of Bayesian posteriors failing to realize the importance of prior and data generating assumptions to make the probability relevant in any way).

Found the reference. Realized CIs in “standard problems” have an interpretation in terms of “bet-proofedness” and Bayesian credible sets. The paper is this one (just published in Econometrics):

Credibility of Confidence Sets in Nonstandard Econometric Problems

Ulrich K. Mueller and Andriy Norets (2016)

https://www.princeton.edu/~umueller/cred.pdf

http://onlinelibrary.wiley.com/doi/10.3982/ECTA14023/abstract

And here are some extracts from p. 3. Interesting stuff.

“Following Buehler (1959) and Robinson (1977), we consider a formalization of “reasonableness” of a confidence set by a betting scheme: Suppose an inspector does not know the true value of θ either, but sees the data and the confidence set of level 1 − α. For any realization, the inspector can choose to object to the confidence set by claiming that she does not believe that the true value of θ is contained in the set. Suppose a correct objection yields her a payoff of unity, while she loses α/(1 − α) for a mistaken objection, so that the odds correspond to the level of the confidence interval. Is it possible for the inspector to be right on average with her objections no matter what the true parameter is, that is, can she generate positive expected payoffs uniformly over the parameter space? … The possibility of uniformly positive expected winnings may thus usefully serve as a formal indicator for the “reasonableness” of confidence sets.”

“The analysis of set estimators via betting schemes, and the closely related notion of a relevant or recognizable subset, goes back to Fisher (1956), Buehler (1959), Wallace (1959), Cornfield (1969), Pierce (1973), and Robinson (1977). The main result of this literature is that a set is “reasonable” or bet-proof (uniformly positive expected winnings are impossible) if and only if it is a superset of a Bayesian credible set with respect to some prior. In the standard problem of inference about an unrestricted mean of a normal variate with known variance, which arises as the limiting problem in well behaved parametric models, the usual interval can hence be shown to be bet-proof. In non-standard problems, however, whether a given set is bet-proof is usually far from clear and the literature referenced above provides little guidance beyond several specific examples.”

“Since much recent econometric research has been dedicated to the derivation of inference in non-standard problems, it is important to develop a practical framework to analyze the bet-proofness of set estimators in these settings. We develop a set of theoretical results and numerical algorithms to address this problem.”

Apologies: it was just published in Econometrica. Grrr… blinkin’ autocorrect.

I found this one very helpful CONDITIONAL INFERENCE FROM CONFIDENCE SETS George Casella, Cornell University https://projecteuclid.org/download/pdf_1/euclid.lnms/1215458835

“Although it might be argued that searching for relevant sets is an occupation only for the theoretical statistician, we must remember that practitioners are going to make conditional (post-data) inferences. Thus, we must be able to assure the user that any inference made, either pre-data or post-data, possesses some definite measure of validity.”

This is a great post.

However, if we’re going to allow parts of statements to be “implicit” – as in B – then C can be okay;

C. “We can be 95% confident that the true mean lies between 0.1 and 0.4”,

, where it’s understood that these are the realized values of the confidence interval for the particular data we have, and that the 95% confidence refers to coverage over repeated application of the confidence interval to similarly-generated data– whether or not one agrees with Russ Lyons depends on what we allow to be implicit, or not. This is not a great state of affairs, but other than giving up on frequentism, or somehow imposing a “correct” definition, I don’t think there’s a good way round it.

(One could similarly complain about Bayesian statements where the prior is not appropriately acknowledged.)

I am jumping in here without being sure that (a) I’m anywhere near correct or (b) I’m commenting in the right place. I’m neither a Bayesian nor a frequentist; I’m at best an infrequentist.

That said, I think the whole business of “repeated sampling” needs clarification (maybe only for me).

“We can be 95% confident that the true mean lies between the lower and upper endpoints of the confidence interval.” What does “95% confident” mean, though, in relation to the repeated samplings?

As I understand it, with the repeated samplings, the percentage of the times that the true value lies within the confidence interval *tends* toward 95%. So you would expect the true value to lie outside the confidence interval for only 5 out of every hundred samplings–but this estimate becomes more accurate as the number of samplings approaches infinity.

So in a sense you’re never 95% confident of anything specific. You’re only confident that as your number of samplings increases, your percentage of outliers will get closer to 5%.

Diana,

Bear in mind that a big part of the problem is, as pointed out by Peter Duong above, that the phrase “95% confident” is an abuse of terminology — or at least, a poor choice of terminology. The phrase “95% confident” is technically (in the context of confidence intervals) *defined* to mean, “we have used a procedure with the property that 95% of all possible samples will result in an interval that contains the true value of the parameter”

Note that I haven’t said “tends to” — that’s deliberate, but it begs the question of what “percentage” means in this context. I am OK talking about a percentage for an infinite set — e.g., “50% of all real numbers are <0" (and also "50% of all real numbers are < 1"). What I mean by such a sentence is "the probability that a randomly chosen real number will be less than 0 is 1/2". (Still some technicalities left out there, but it gets closer to being a precise definition.)

See also Tom Dietterich’s comment below.

Hi Diana

As I understand it, with the repeated samplings, the percentage of the times that the true value lies within the confidence interval *tends* toward 95%. So you would expect the true value to lie outside the confidence interval for only 5 out of every hundred samplings–but this estimate becomes more accurate as the number of samplings approaches infinity.Not quite. The definition is with regard to an infinite number of replicate samples. If we replicated the study an unlimited number of times, calculating a confidence interval in the same way as we did for the actual data, 95% of the confidence intervals would cover the true value of whatever it is we’re estimating. It’s in this sense that we are “95% confident” that the intervals produced cover the truth.

The terminology is very confusing, possibly hopelessly so, but that’s what it means.

Thank you, Martha and George. It looks like my error here was in the use of the word “tends.” I was assuming that as the number of samplings increases, the percentage whose confidence interval contains the true value will come closer to 95%. That is, with 100 samplings you might be way off of 95%, but with 1000 you’d expect to be closer, and with 10,000 closer still. Or so I thought.

But I see now that this isn’t about “tending” but rather about the correctness of the algorithm over infinite samples. You cannot necessarily expect to get asymptotically closer. As Tom Dieterrich says, there is no guarantee for any particular execution (or, as I take it, number of executions) of the algorithm.

It is good to be able to sort this out; thank you!

I think less an error than bringing in an additional uncertainty that was not necessary in the conversation.

The conversation is about what would happen in an infinite number of replicate samples and you have raised the issue of what will happen in a given finite sample (which is subject to sampling variability which decreases with increasing n).

Agreed, great post. I think this is really important. Why? Because medical researchers are told by reporting guidelines and by statisticians (since the 1980s) that they should report confidence intervals instead of significance tests, so confidence intervals are everywhere. Despite that, almost nobody understands their correct meaning because it’s really hard to get your head around. In my experience, the usual interpretation adds another layer of wrongness to sentence C, which is that values close to the point estimate are more likely or more plausible (as if the CI represents a probability distribution). It may be OK most of the time that your interpretation is a bit off and not actually correct, but sometimes (and you probably won’t know when) imposing a wrong interpretation may fail badly and lead to important errors. So I think it does matter that people should understand what they are doing and interpret things according to their actual meaning.

Fundamentally, confidence intervals are not the right thing for what most people want to use them for.

What do you think most people want to use them for?

To say what are the most likely values of a pparameter. given the data.

And you are saying in part that most people interpret them as point + error rather than as an interval estimate, right? I think that’s probably right especially since they are often graphically displayed that way. Even the way students learn to calculate them by hand probably adds to that perception.

Yes, exactly.

Simon,

The “plausibility” of values within a CI is in fact associated with a probability distribution. That is, not all values within a CI are equally “plausible”.

For instance, if your data follows a normal distribution, then under repeated sampling, if you divide the 95%CI into 4 equal-sized parts, the true mean lies within the center two parts ~68% of the time and within the outer two parts ~27% of the time. Here, values within the CI are more “plausible” the closer they are to the point estimate (i.e., center) of the CI.

Confidence Intervals provide a great example of the value of “computational thinking”. We should regard the process of drawing a sample and computing the upper and lower bounds of the confidence interval as an algorithm. The frequentist statement is that this algorithm will output the right answer (i.e., the bounds will cover the true value of the parameter) with probability 1-alpha where the probability is taken over executions of the algorithm. The confidence statement is a statement about the probabilistic correctness of the algorithm in general, not about any particular execution of the algorithm (for which no guarantee is provided).

Computer scientists like confidence intervals precisely because they correspond to a notion of the probabilistic correctness of an algorithm.

[In contrast, a (non-stochastic) Bayesian algorithm does not require a probabilistic guarantee. It either correctly computes the posterior or it does not. Of course if the method employs Monte Carlo sampling, then it too requires a probabilistic correctness guarantee. In such cases it is natural (and not inconsistent) to require a frequentist guarantee of correctness of a Bayesian computation.]

The bottom line is that confidence intervals are statements about algorithms and not statements about the world.

Yes, exactly, which is why I explicitly put it in terms of “functions that compute…” in my comment above:

https://andrewgelman.com/2016/11/23/abraham-lincoln-confidence-intervals/#comment-351782

But, the usual thing is to get one data set, create a simplistic model of repeated sampling that is a poor model of the science, produce one confidence interval, and then immediately make the following logically fallacious inferences:

A = (my algorithm corresponds to the way the world works), however this is KNOWN to be FALSE very often

B = (95% of intervals generated under repeated sampling contain the true value of the parameter)

C = (My particular interval contains the true parameter with 95% probability)

if A then B (a true statement about an algorithm.)

A is true (a most often FALLACIOUS statement about a scientific process)

therefore B is true (B actually has totally unknown truth value due to above)

if B then C (This is a FALSE statement regardless of whether B is true because frequency and probability don’t mean the same thing to most people, but it appears true as soon as you confuse frequency and probability together formally)

B is true therefore C is true (FALLACY: remember, B has no known truth value because A is FALSE. Even if B were true the statement “if B then C” is basically false when looked at carefully due to “probability” not meaning long run frequency to real world people who haven’t been brainwashed by stats classes)

So, after a fallacious assumption about a non-existent sampling process, and a pun between “probability as frequency” and “probability as degree of credibility” which makes “if B then C” seem like a true statement… people fallaciously deduce “my particular interval contains the true parameter with 95% probability” sometimes even when they can know with 100% credibility before looking at the data that the parameter can NOT possibly be in the interval (for example, when the parameter logically has to be greater than zero but the confidence procedure includes only negative values… the kind of thing that *can* happen with confidence procedures).

I think most people would agree that inference based on a simplistic model that is a poor model of the underlying science isn’t going to work well.

Not always easy to notice a model is poor for the underlying science https://andrewgelman.com/2016/08/22/bayesian-inference-completely-solves-the-multiple-comparisons-problem/

There a poor but fully Bayesian model produces the usual confidence intervals with those taken to be good frequency confidence coverage properties.

Makes one wonder – good, good for what?

But “Random sampling” is used as a default model for LOTS Of stuff in science EVERY DAY, for example the effectiveness of drugs, and yet the only situation where it’s a good model is basically surveys of finite populations using random number generators.

So, if you want to find out the average age of drivers in California, if you randomly sample from the DMV registration it would be a good model to say that your data comes from a random number generator process. But if you want to find out how well crops grow given a certain fertilizer it’s a TERRIBLE model.

Alien report: We have found a planet, it is called Earth. There is at least one intelligent species that goes by the name of “Humans” or Homo Sapiens. They have discovered nuclear power and computation. They have visited their satellite (the “Moon”) and are planning to visit a neighboring planet (“Mars” further away from the central star, the “Sun”). They are starting to understand and control the genetic information that they and other species contain and gets replicated upon reproduction. They have figured out confidence intervals long ago, but they like to keep arguing about them.

I sure hope the aliens won’t be gene-centrists. One species of those is enough.

More seriously, what about the fact (?) that frequentist CIs correspond to Bayesian credible intervals assuming flat priors? Wouldn’t that provide a practical interpretive benefit to users despite mixing up philosophies?

I think it would be more correct to say that SOME frequentist confidencee intervals correspond to Bayesian credible intervals assuming a flat prior. There isn’t a uniquely correct confidence interval generating procedure, so the correspondence isn’t guaranteed. They aren’t the same though, because the credible interval is a summary of a probability distribution and the confidence interval isn’t.

I find this an interesting subject. Non-statistician here, but trying to wrap my head around this. This maybe stats 101 but am I understanding this correctly?

A) If confidence intervals are always constructed under the assumption that sampling was random (i.e. not systematically biased), do we really need to stress the repeatability of the experiment itself… Or could we say CIs are really about the amount of error that I’m willing to accept as a researcher? I mean, suppose a researcher conducts 100 unrelated experiments, could (s)he look back and say (assuming random sampling) that 95 of his/her CIs contain the ‘true value’ of the parameter?

B) A phrase like ‘true value’ doesn’t communicate that if the measurement procedure is biased (e.g. systematically underestimates) less than 95 of 100 CIs will contain the true value, no? It seems to me CIs are about precision, not accuracy then.

There is a lot packed into (A) but the answer at the end is that the expected value of the number of “true value containing” CIs is 95 but there is no way to know for certain what it actually is. Just like if you roll a die 60 times under the hypothesis that the die is fair the expected number of 3s is 10 but you don’t actually know that it will be 10 and it’s actually not much more likely to be 10 than to be 9 or 11. If the die is fair it is unlikely that you would get 0 3s, but not impossible. But if you get 0 3s, you might want to reconsider the hypothesis that the die is fair.

(B) I’m not sure what you mean by the first sentence, but yes part of the discussion is that assuming simple random sampling and perfectly accurate measurement is assuming a lot. CIs are based on a theoretical model about sampling error not about measurement error.

(B) is related to the following question: Even if a researcher has a good understanding of what “confidence interval” means, they often need to discuss results of a study with someone (e.g, a boss, a physician, a member of the school board) who doesn’t understand and isn’t willing or able to go to all the work involved in understanding. So there is a practical question of what is a *good enough* explanation of confidence interval for such a “layperson”.

I don’t have a good answer to this, but have come up with a possible candidate — something like, “Drawing inferences from data based on a sample always involves some uncertainty. The confidence interval helps describe some of this uncertainty — namely, the uncertainty (“sampling uncertainty” ) arising from the fact that we do not have complete data, but have to estimate from a sample. There are also other types of uncertainty involved — for example, measurement uncertainty, which we might or might not be able to estimate, depending on the particular circumstances. Unfortunately, the confidence interval itself involves some uncertainty: If all conditions are in place, the confidence interval only does what we intend for most samples we might encounter, but will always miss the mark for some small percentage of them.”

This is probably too unwieldy for the intended purpose. Does anyone have any better suggestions?

Obvious suggestion – give them bayesian results, which they can understand.

(someone had to say it)

The trouble with this is that the people who are facing this problem are often constrained by protocols that prescribe frequentist methods.

Also, I’m not convinced that “they” can all understand Bayesian results; many laypeople (e.g., many physicians. I have read) believe that “uncertainty between two possible outcomes” must mean that each possibility has probability 1/2.

My experience is that people (meaning mainly health professionals and researchers) interpret frequentist results in a Bayesian way, because those are the results that they really want. But I’m not sure if there is any real evidence that bayesian results lead to better understanding and better decisions than frequentist results. I’m not even sure how you would approach that question, but I’d really like to know if anyone has tried.

My experience is that many people do interpret frequentist confidence intervals in a Bayesian way. However, I wouldn’t go so far as to say that this is because it is what they want; I’d say that it is a more because the correct interpretation of the frequentist confidence interval is complicated, whereas the Bayesian interpretation is easier.

By the way, I’ve found that one way to help people understand what frequentist confidence intervals are (and are not) is to teach them enough Bayesian analysis that they can see the difference between Bayesian intervals and frequentist intervals.

Unfortunately, there is not time in a typical intro stats class to do this. However, I was fortunate to be able to teach (for four summers) a prob/stat course for a master’s program for math teachers, who had already had a frequentist introductory statistics course. They really liked the Bayesian approach, and it helped them understand the frequentist approach better. But they had a better math background than many people taking an intro stats course, which made the course I taught feasible for them to grasp.

This problem of the intro course is one I always have in mind, and I think that in a way i wish I could just not cover some things, but if students only ever take one class in statistics, I think they need to be able to walk away with the idea of sampling variation.

[…] Gelman uses a nice analogy involving Abraham Lincoln to explain what’s wrong with a common mistaken way of describing confidence intervals. I may use this […]

[…] had a vigorous discussion the other day on confusions involving the term “confidence interval,” what does it mean to have […]

“We can be 95% confident that the true mean lies between 0.1 and 0.4.”

I haven’t read all comments and perhaps somebody else has already notes this, but…

“we can be x% confident that…” has no meaning whatsoever outside the context of confidence intervals.

I’m actually with Russ Lyons: “Russ Lyons, however, felt the statement “We can be 95% confident that the true mean lies between 0.1 and 0.4,” was just fine. In his view, “this is the very meaning of “confidence.’””

But if this is so, saying that “We can be 95% confident that the true mean lies between 0.1 and 0.4.” doesn’t *add* anything to the calculation of the CI. It is certainly not an interpretation, it’s just the very same thing.

Christian, as we argue here, any assignment of specific “label” – probability, confidence, call it X-ness – will fall victim to the reference class problem, because there are any number of procedures that could have generated the specific interval that have difference X-ness. The obvious examples are mixture procedures (flip a coin, produce one or another interval) but also other functions of the test statistics might produce the interval in question in the specific case but have different confidence coefficients. It never makes sense to take the confidence coefficient and apply it to the interval, regardless of what you call it.

Richard:

I would be interested to know what you think of the Mueller-Norets (2016) paper and the “bet-proofness” concept for interpreting realized CIs that I’ve been rabbiting on about (sorry) in the comments here and in the other blog post. The previous literature M-N cite overlaps with what you cite in some of your papers (Buehler 1959, Robinson 1979, maybe some others?). “Bet-proofness” seems like a legitimate interpretation for realized CIs that is not unintuitive, and is applicable to a lot of everyday cases (what M-N call “standard problems”). Their paper is about extending it to some non-standard problems.

–Mark

OK, fair enough. I still think that one could define “we’re 95% confident” in this way, but then this means that this way of speaking doesn’t rule out that we get a different confidence value from a different procedure for the same data, so it’s also fine to argue that this definition, although not “wrong”, is somewhat misleading.

My main point was not about the correctness of the statement but rather about the fact that the statement, if used along these lines, is not an interpretation because it doesn’t interpret or explain anything in terms that would be comprehensible outside the CI context. So I’m with you in objecting against its use, if for different reasons.

My estimate of the population mean is between 1.0 and 4.0 (95% CI).

You then proceed to write your article, book, mission statement, treatise, what have you, as if you were correct this particular time that the parameter lies in the interval. You only get a chance to be correct 95% of the time if you make the claim that it’s in the interval every time.

I’m amazed that the same people who balk at that advice have no problem saying, “there was an effect with a mean of 2.0, p < .05," and then discussing it as if there really is an effect without any qualification, statement of confidence, etc. Even if you remind them they're probably only correct about that statement about 50% of the time at best they still run with it like it's truth granted from God and is irreversible. Any contrary evidence only shows you moderators.