We had a vigorous discussion the other day on confusions involving the term “confidence interval,” what does it mean to have “95% confidence,” etc. This is as good a time as any for me to remind you that I prefer the term “uncertainty interval”. The uncertainty interval tells you how much uncertainty you have. That works pretty well, I think. Also, I prefer 50% intervals. More generally, I think confidence intervals are overrated for reasons discussed here and here.

But a standard frequentist “confidence interval” doesn’t necessarily represent uncertainty. For instance, if I use some standard procedure to compute a frequentist 95% confidence interval for some parameter that turns out to be (-0.9,+0.1), and I happen to have strong theoretical reasons to think that the parameter cannot be negative, then this interval certainly does not represent my uncertainty.

Now, I wouldn’t use “confidence interval” for a Bayesian posterior interval of some sort – I see it as a fundamentally frequentist term. “Uncertainty interval” might be a good term in that context.

Radford:

Mathematically speaking, a standard frequentist interval need not represent “uncertainty” in its usual English sense. But, for the same reason, it also need not represent “confidence” in its usual English sense.

One option would be to simply call it “a frequentist coverage interval” and eliminate all senses of confidence, uncertainty, probability, etc. But if we

aregoing to use an evocative term, I’d prefer “uncertainty” to “confidence” because in practice these intervalsareused to convey uncertainty.As I see it, confidence intervals are indeed *intended* to convey some idea of the uncertainty involved in the estimate. And, in my experience, conveying the idea that there is some uncertainty in the estimate is important. The word “confidence,” on the other hand, may convey a sense of, well, of confidence in the estimate, which to many conveys a sense of certainty — which is not a good thing.

Andrew:

Following on from the other discussion, what would be your preferred terminology for distinguishing between the procedure for calculating the interval vs an actual interval using some dataset? “Uncertainty interval” vs “realized uncertainty interval”? “Uncertainty interval estimator” vs “uncertainty interval estimate”? Something else again?

Maybe I’m in the minority but I think I’ve convinced myself that failing to distinguish between the two concepts is a bigger terminological sin than the use of the word “confidence”.

(PS: I rather like your suggested “frequentist coverage interval”. Or even just “coverage interval”.)

I also like “coverage interval” (but dislike “uncertainty interval”). And for the sake of being thorough: I like the cat picture.

But I think “coverage interval” is still not quite accurate enough, since the realized interval doesn’t actually have the coverage, but the procedure generating the intervals guarantees (conditional on the assumptions) their overall coverage. So, how about “coverage procedure interval” or, in full, “frequentist coverage procedure interval”? And add “estimator”/”estimate”/”realized” etc. as needed.

I’d suggest something like “uncertainty interval procedure” vs “uncertainty interval output” (or perhaps “uncertainty interval estimate”?). I think “realized” is likely to have connotations of “real” — i.e., of certainty, which goes against the purpose of conveying uncertainty.

The translation problem is also in the word “interval”. Ears hear that with a variety of meanings. It is a measure of uncertainty but that idea has sign and magnitude, as in you approach certainty or move away from it and you have to worry about the meaning of that uncertainty and thus its effect, which can be approached again with sign from no effect up or toward no effect. That last can be really hidden – and that is a source of so much of the problem with social sciences, that the effect is getting smaller. This of course means they’re examining the vanishing points of effects too often, which happens in a type M situation, meaning the universe of contexts in which an effect matters is a set c and you test c1 not realizing c1 is largely disjoint from c. That stuff is highly resistant to statistical analysis, which is the same as saying it fails if you believe in statistics! For example, if you define the c context in which “power pose” matters, then it would be complexly determined, meaning c approaches uniqueness in the ways we can measure and it becomes a rare result. The difference then between this rarity or unique manifestation in context is possibility: as in a mentalist can infer from close observation and control of the communicative context (like in a show or as a pickpocket) how you will act and thus predicts what will happen with sufficient accuracy that some people can do this quite well and we all try to do it. But no one can read minds or read cards without cues because that is impossible given our nature and the limits of our perceptions.

In the last the uncertainty is nil. We just say, “This is bull crap and that’s not in doubt.” But when you approach a balance point between certainty and uncertainty, then we start to worry about how certain we are or how uncertain we are. I’m trying to say that we rarely need to worry about this issue when things are obviously true or false, on or off, 1 or 0 so that reduces the hard cases to those which hit near enough the balance point: is this true or not? Or is this false or not? (And the converses, which unfortunately statistics decided to use as the general approach: is this getting true from being false or is this getting to be false from being true?) To say, this is the confidence interval or uncertainty interval balls all this stuff up and that means people tend to read it as they want or can. To focus on the specific interval, label it at both ends with both sets of labels and you see it’s where uncertainties balance with certainties. That extends to effects. I think of it as the “if.interval”, as in “if true, how far from false and if false, how far from true?” In my peculiar notation, the dot signifies process so you have if processed over interval, which automatically should imply bidirectionality with appropriate endpoint labeling. I subdivide into direction: if-1.interval and if-2.interval so – means direction, which labels the endpoints appropriately, and thus the disappearance of – has meaning.

Wanted to add you can view that last bit as a null in an array which is either stated as the null of the attribute or by a higher level treatment in which non-appearance carries whatever information is contained in the null. Computer languages don’t do a good job of this.

I can’t help thinking of the scene in A Serious Man:

Larry: I can interpret, Clive. I know what you meant me to understand.

Clive: Mere sir, my sir.

Larry: Mere sir, my sir?

Clive: Mere surmise, sir. Very uncertain.

Diana:

Hey, we just saw that movie a couple days ago!

I think the fact that lots of really clever people are having long discussions about how to interpret confidence intervals shows pretty clearly that there is a major problem here. Part of the issue is surely (see the Abraham Lincoln post) the conditioning on a specific data set: people need to say something based on the data that they have, but the definition of conficdence intervals is about the population of all possible intervals. Or about the procedure for generating the CI rather than the CI itself. So there is a real disconnect between howit is defined and how it is used.

The common use as an indication of uncertainty in an estimate seems to depend on it being kind-of OK most of the time. Is that really good enough?

But there is a formal interpretation available for a CI after conditioning on a specific dataset, at least for “standard problems” – see the Mueller & Norets and Casella papers cited in the Abe Lincoln discussion. Yet hardly anyone seems aware of it or uses it (mea culpa – I learned about it only fairly recently myself):

Links here:

http://andrewgelman.com/2016/11/23/abraham-lincoln-confidence-intervals/#comment-353238

http://andrewgelman.com/2016/11/23/abraham-lincoln-confidence-intervals/#comment-351787

I’ve had a quick look at those papers but I’m not sure what they add in terms of an everyday easy-to-understand interpretation of confidence intervals. Probably I am missing something that is in there but can you summarise?

The concept is “bet-proofness”. Like you, I’d like something I can point to that gets the idea across in an easy-to-understand way, but I haven’t been able to find anything other than these rather technical papers and the previous literature they cite. The best I can do is cite the Mueller-Norets (2016) paper and then try paraphrasing.

Mueller-Norets (2016, published version, p. 2185):

“Suppose an inspector does not know the true value of θ either, but sees the data and the confidence set of level 1−α. For any realization, the inspector can choose to object to the confidence set by claiming that she does not believe that the true value of θ is contained in the set. Suppose a correct objection yields her a payoff of unity, while she loses α/(1−α) for a mistaken objection, so that the odds correspond to the level of the confidence interval. Is it possible for the inspector to be right on average with her objections no matter what the true parameter is, that is, can she generate positive expected payoffs uniformly over the parameter space?”

If the answer is “no”, the confidence set is bet-proof. For “standard problems” (see Mueller-Norets), a standard frequentist realized CI is bet-proof (simplifying a bit here).

Paraphrasing: say α=0.5, θ is a scalar, and I’m a casino customer. The casino picks a θ, then uses θ and a computer to generate a random dataset, and then uses this dataset to calculate a realized frequentist 50% CI. They show me the calculated 50% CI but not the true θ. I can bet that the CI doesn’t contain the true θ. If I’m right, I get $1. If I’m wrong, I lose $1. Can I make money on average? No – realized frequentist CIs are bet-proof, for “standard problems”. (Simplifying a bit again, but I *think* I got that right.)

The “standard problems” caveat rules out settings that can generate e.g. empty CIs, because I can make money on average if I bet “no, the CI doesn’t contain the true θ” every time I see a realized empty CI (easy money – the realized CI is guaranteed to be wrong).

This is good, but in a teaching setting you now have to mention that “bet-proofness” is not the same as the more commonly encountered “no-arbitrage” (or if you insist, “no Dutch book”), as it refers to expectation with respect to the ensemble probability. I.e., convergence in probability, not almost-sure.

Russian dolls!

Ouch! That’s going in the opposite direction. Do you have “an everyday easy-to-understand interpretation” for distinguishing between them?

This seems a bit backwards to me; we have these things (Intervals Formerly Known As Confidence Intervals) that people produce all the time, but misinterpret. Isn’t the right response to find out what intervals people want (i.e. what people want to know) and produce those, rather than try to find a valid interpretation of the things that they already have, which may not answer their scientific questions any better?

I’m not sure that “bet-proofness” is likely to be very helpful to people but I suspect I’m still not understanding something there.

[I see Daniel said a similar thing further down but by inserting my comment here it looks as though I said it first!]

Why not do both?

And also on a pragmatic note: as you say, people produce these things all the time, and it’s hard to stop them from putting an interpretation on a realized CI. The temptation is often just too great. But I’m a little more optimistic about getting people to use an interpretation that actually has a legitimate foundation. It’s easier to tell someone “don’t say that – say this instead” than it is to be completely negative and just tell them “don’t say that”. Even if the legit interpretation isn’t so easy to understand, it’s still an improvement over using one that’s wrong.

Can’t say I’m a fan of either ‘confidence’ or ‘uncertainty’ in the Freq case. Still too much of a positive statement to me. How about ‘adequacy’ or ‘consistency’ interval. Laurie Davies has advocated the former, though divorced it further from Freq origins. No immediate issue with overprecision under misspecification either – here a small range of consistent or adequate parameter values makes sense.

Apologies to most, if not all, of the commentators. Many of you are far better trained than me, but I still don’t see this discussion as advancing anything at all. I think most of the concerns are about how people will mis-interpret statistical information. We know that people have a bias towards certainty when there is none. I think confidence intervals are an advance – compared with using the point estimate as if it were 100% certain (which is still what many people do). Treating a 95% confidence interval – mis-interpreted or not – as a 100% confidence interval is also a very human (and flawed) reaction. Changing the name from confidence to uncertainty or adequacy, or consistency or “cat interval” may indeed help people understand just how uncertain things really are, but I fear that it may have the reverse impact. If it is so hard to interpret the interval, then people may end up just using the point estimate.

The other strand of thought that keeps recurring is the Bayesian-frequentist debate. I really don’t see confidence intervals as the avenue with which to convince the world they should be Bayesians. I’ve seen plenty of coherent arguments on this blog that make that point very sell – and, yet, most of the world is trained (if at all) in the frequentist methodology. Do you think that attacking the confidence interval is really an effective argument for convincing people to become Bayesian?

I do think we can teach people what a confidence interval actually means and that the common use of that interval, while flawed, does convey some information. I only want a sense of humility about what that information is. After all, the problems with incorrect interpretations of the confidence interval are not as serious as the issues with forked paths, lack of replicability, low power, failure to include prior information, etc. etc.

“I do think we can teach people what a confidence interval actually means and that the common use of that interval, while flawed, does convey some information.”

Could you expand on what information conveys? (Serious and sincere question, not rhetorical or sarcastic.)

My feeling is that people commonly misinterpret frequentist 95% CI’s as Bayesian credible intervals in at least two ways: 1) that they mean “there is a 95% chance that…”, and 2) that the CI and its guarantees apply to the particular sample we have in hand rather than to a process repeated an infinite number of times. Once you correct these misconceptions, though, what’s left for the frequentist-because-that’s-how-I-was-trained practitioner?

I have to confess that I can’t see the bridge from the seemingly abstract frequentist definition of CI’s — as I understand them — to practical use on a particular sample that I happen to have. It seems like once the misunderstandings are corrected, we’re left with something of a vague, relative comparison between CI’s: “This one’s bigger, so it’s more uncertain in some sense, at least as long as we weren’t unlucky enough to have a sample from the 5% of all possible samples where the CI may be nonsense.”

I agree that forked paths, low power, etc, are more important in some sense. At the same time, forked paths, power, etc, seem like “more advanced” topics that involve things like experimental design, while CI’s are “basic” topics that are built in to every piece of statistical software we might use and are de riguer in most fields, so perhaps we have more chance of making an impact in the basics. (And in my wild-guess estimating I wonder if the more advanced topics more severely affect some studies, while the basic topic mildly affects all studies and the sum over all of science is about the same. Just my fantasizing, really.)

At the risk of repeating the same discussion just experience (twice) here goes:

To be concrete, let’s say we have a sample with a confidence interval for the mean that ranges from 1.1 to 1.4. It would be correct to state that, with repeated sampling, 95% of the random samples (of this size, etc.) will contain the true mean. It would not be correct to state that we are 95% confident that the true mean lies in the range of 1.1 to 1.4, i.e. this particular confidence interval. The true mean either is in that interval or not. And we only have this one sample. So, what is the probability that the true mean is in the interval 1.1 to 1.4? Now we are on a slippery slope. Rather than debate the meaning of “probability” (which is a key issue, I agree, and one worth exploring), our particular interval is either one of the 95% that contain the true mean or one of the 5% that does not. We don’t know which, but if you ask me what the probability is that we have one of the “good” ones, I’d say 95%. And I’d be wrong – but how wrong? What would you rather I say about the one interval I have? Personally, I’d prefer to say 95% (with all of its faults and potential for misunderstanding) than to say nothing at all.

The solution is simple. Just DON”T MAKE confidence intervals. Build Bayesian models, and then spend your time arguing about whether the Bayesian model is an adequate representation of the scientific knowledge instead of whether the computation you just finished doing is an adequate proxy for something you actually cared about.

This is not a million miles away from the concept of “bet-proofness” of realized CIs – see the earlier comment. And bet-proofness seems to be a perfectly legitimate way to talk about a CI calculated using a sample. No slippery slopes if this is used to frame it. I should repeat the “standard problem” caveat, though.

Mark:

I am not sure and it take a fair amount of effort to understand their claimed solution and what trade-offs it makes in order to be bet proof.

I do think its the wrong strategy – searching for procedures that meet properties that are taken as good (good for what?)

From their conclusions paragraph “… , we derive confidence sets that are reasonable by construction. Specifically, we suggest enlarging a credible set relative to a prespecified prior by some minimal amount to induce frequentist coverage.”

I doubt if that prespecified prior is trying to represent an underlying reality in any meaningful extent.

Additionally “One might also question the appeal of the frequentist coverage requirement [uniform for all possible parameter values]. We find Robinson’s (1977) argument fairly compelling: In a many-person setting, frequentist coverage guarantees that the description of uncertainty cannot be highly objectionable a priori to any individual, as the prior weighted expected coverage is no smaller than 1 − α under all priors.” I don’t agree. Someone might have a prior on effect sizes in psychology that put considerable probability on huge effect sizes – I think I should be able to ignore them.

Keith,

I think you’re conflating the “standard problem” result vs. what their paper is mostly about.

The quote from their conclusion paragraph about “enlarging a credible set” that you cite refers to what they call “non-standard” problems. An example is a procedure that can generate an empty CI (see the earlier comment). To be fair to the authors, the contribution of the paper is about these non-standard problems so that’s what they spend most of their time discussing.

But for standard problems (p. 2186) no enlargement of a credible set or anything else is necessary. Off-the-shelf realized CIs for standard problems have the bet-proof interpretation. (At least, that’s how I read it.)

> I think you’re conflating the “standard problem” result vs. what their paper is mostly about.

No, in fact I am much more interested in “non-standard” problems.

> Off-the-shelf realized CIs for standard problems have the bet-proof interpretation.

OK but that does not mean they perform well on what matters – e.g. for type S and M errors see

http://andrewgelman.com/2016/08/22/bayesian-inference-completely-solves-the-multiple-comparisons-problem/

Fair enough. I’m also interested in the nonstandard problems (esp the Anderson-Rubin intervals that are discussed in the Mueller-Norets paper). But I read Andrew’s original posts as being made mostly with what what M-N would call “standard problems” in mind.

Maybe worth quoting here how they define “standard problems”:

“In the standard problem of inference about an unrestricted mean of a normal variate with known variance, which arises as the limiting problem in well behaved parametric models, the usual interval can hence be shown to be bet-proof.”

After you finish changing minds about the use of this term, Andrew, try your hand at changing the name of the joint ASA/RSS magazine from “Significance” to something more appropriate.

Call it “Uncertainty”! Or, per other comments above, “Credibility”, “Coverage”, “Realized confidence”, “Adequacy” or “Consistency”. It will fly off the shelves.

I agree with Mike — but I’m not in love with any of the suggestions from George. Perhaps “Statistics and Uncertainty”?

My proposal: Call it the if-you-were-to-run-the-study-a-hundred-times-which-of-course-you-never-will-but-just-stay-with-me-here-then-95-of-those-studies-would-produce-intervals-that-contain-the-true-value-and-by-the-way-the-one-interval-you-have-is-one-of-those-intervals interval.

+1, but this would make it hard to tweet anything about those-things-formerly-known-as-confidence-intervals (hmm, maybe I should start calling them TTFKACI)

Nice try, but no cigar — you’d need to say (insert your own hyphens):

“… all possible samples of the same sample size … 95% of them would produce intervals that contain the true value … and this also assumes that all the model assumptions are true .. and you have no knowedge of whether or not the one interval you have contains the true value.

At least in my field, assumptions are for the birds.

Ah, Twitter-feed!

I’d prefer something that isn’t formula-dependent, something like “Listening to Data.” A bit off-the-beaten-path, but something with that flavor.

95% Coverage Rate Under Disingenuous Expectations (CRUDE) interval

+1

From some of the comments, I was thinking “repeated use coverage interval” as that points to the repeated use sense and it allows it to be frequentest if the repeated use is specified under the same parameter value _and_ that coverage is uniform for all possible parameter values or Bayesian if it is specified under parameter values repeatedly drawn from a prior.

Now that prior should likely not be the one assumed but rather prior(s) worried about and here it might make sense for that to be a point parameter. Note from this Bayesian perspective, the coverage would not be uniform but why should uniform coverage be seen as of overriding importance.

This would be one way to do what Daniel suggests http://andrewgelman.com/2016/11/26/reminder-instead-confidence-interval-lets-say-uncertainty-interval/#comment-353634

This won’t be easy – getting at the pragmatic meaning of concepts (what to make of instances of the concept for future thinking and actions) is always difficult and somewhat elusive.