## Problematic interpretations of confidence intervals

Rink Hoekstra writes:

A couple of months ago, you were visiting the University of Groningen, and after the talk you gave there I spoke briefly with you about a study that I conducted with Richard Morey, Jeff Rouder and Eric-Jan Wagenmakers. In the study, we found that researchers’  knowledge of how to interpret a confidence interval (CI), was almost as limited as the knowledge of students who had had no inferential statistics course yet. Our manuscript was recently accepted for publication in Psychonomic Bulletin & Review, and it’s now available online (see e.g., here). Maybe it’s interesting to discuss on your blog, especially since CIs are often promoted (for example in the new guidelines ofPsychological Science), but apparently researchers seem to have little idea how to interpret them. Given that the confidence percentage of a CI tells something about the procedure rather than about the data at hand, this might be understandable, but, according to us, it’s problematic nevertheless.

I replied that I agree that conf intervals are overrated, a point I think I discussed briefly here. We used to all go around saying that all would be ok if people just ditched their p-values and replaced them with intervals. But, from a Bayesian perspective, the problem is not with the inferential summary (central intervals vs. tail-area probabilities) but with those default flat priors, which are particularly problematic in “Psychological Science”-style research where effect sizes are small and estimates are noisy.

1. ConvexPhil says:

The link to the paper is behind a security access check of some kind!

“Forbidden

You don’t have permission to access /inpress/HoekstraEtAlPBR.pdf on this server.”

• ConvexPhil says:

Judging by the other comments, I guess I’m the only one who had this problem? Looked up the paper by other avenues, anyway. A title would’ve been helpful, though!

2. Mark says:

Confidence intervals clearly are difficult. The authors of this paper reported confidence intervals for a convenience sample (p. 4). No super-population was posited, no data generating process, no IRT/measurement model. To what are they inferring?

The authors are quick to trumpet the failures of their respondents to properly answer the questions, but I wonder if they asked the right questions. As Gelman says in abstract of the linked paper, “The formal view of the P-value as a probability conditional on the null is mathematically correct.” All of the questions are intended to be asking about the marginal probability of the of the hypotheses, but is it reasonable to interpret the questions as asking about conditional probabilities? I’m reminded of the Monty Hall problem in which two reasonable interpretations (one in which the probabilities are marginal, one conditional) lead to completely different conclusions.

3. This paper provides a very nice review of older work and interesting information on a new survey. Unfortunately, it promulgates some incorrect information itself. One aspect was already noted by Mark. To quote FPP (Freedman, Pisani and Purves, 4th ed.): “The formulas for simple random samples should not be applied mechanically to other kinds of samples.” “With samples of convenience, standard errors usually do not make sense.” (p. 437)

The other error is that the authors state that their option 5, “We can be 95% confident that the true mean lies between 0.1 and 0.4.” is incorrect, whereas, in fact, it is correct. Instead, the authors say, “The correct statement, which was absent from the list, is the following: ‘If we were to repeat the experiment over and over, then 95% of the time the confidence intervals contain the true mean.'” However, this is the very meaning of “confidence” and why it is different from “probability”. Again, to quote FPP:

“You can be about 95% confident that the population percentage is caught inside the interval from 75% to 83%.” (p. 381)

“The word confidence’ is to remind you that the chances are in the sampling procedure; the average of the box is not moving around.” (p. 417)

Fortunately, these errors do not substantially change the important points the authors make.

Naturally, I do recommend FPP for everyone and, in particular, whenever statistics is taught. FPP has many exercises on proper interpretation of statistics. Without being tested on them, many students will not learn proper distinctions.

• Andrew says:

Russell:

I don’t think that “we can be 95% confident” works as a general way to state confidence intervals. To see the problem with this interpretation, consider one of those problems where the 95% confidence interval is sometimes the empty set and sometimes the entire real line. In this case, if the interval happens to be the empty set, should someone say, “We can be 95% confident that the parameter is in the empty interval”? Or, if the interval happens to be the whole line, should someone say, “We can be 95% confident that the parameter is a real number”? I don’t think so. I don’t think it makes sense to say you’re 95% confident about a statement that you know to be false, or you know to be true. The trouble is that the English language already exists so you have to be careful about taking a word such as “confident” and taking it too far from its meaning in the language.

Regarding the use of standard errors for samples of convenience: this is tricky. The way I will usually interpret these is as inferences for the population for which we can consider the data as a random sample. For example, in this blog we’ve been discussing some problems with “Psychological Science”-type studies that draw general conclusions based on nonrandom samples of college students and people on the internet. And I’ve been pretty firm that with such samples, you’re not necessarily learning about the general population (at least not without some strong assumptions about differences of interest not varying across population subgroups). But I wouldn’t say that standard errors from such nonrandom samples “do not make sense”; rather, I’d say that they can be interpreted as sampling variability relative to the population that the sample represents (which is not, unfortunately, the same as the population of interest).

Finally, I don’t want to get into a big discussion of this, but I would not take the FPP book as an authority on statistics. It’s an introductory textbook and takes liberties so as to be less confusing to students. Intro textbooks do this all the time, so I’m not saying it’s a bad book, but I wouldn’t take its pronouncements too seriously. If I had to take an intro textbook as an authority, I’d go with the book by De Veaux which is a bit more modern, but really you just have to be careful, as any intro book has will in some places be simplistically permissive (to give students a sense of how the methods can be applied in practice) and in other places be simplistically restrictive (to give students a sense of how the methods can go awry).

• John says:

I agree Andrew. I never teach my students to associate the word confident with the interval and try to describe it as just a way to label the interval. It could be the “orange” interval but the label we’re using is descriptive of the method.

However, if you genuinely are not in a situation where you can have any further certainty about whether the interval does, or does not contain the true value, then you can know the method you used makes you correct about the interval containing the mean 95% of the time. Some might call that 95% confidence. From a Bayesian perspective you might argue that’s a rare occurrence, or that it never occurs. But that’s a separate philosophical debate. I think that your average undergrad doing a project where they estimate an interval on a fairly large effect probably has pretty good standing to claim 95% confidence, whereas a scientist who estimates an interval containing 0 where there are sound reasons it should not be in the interval can probably exclude some of the range and 95% doesn’t apply post hoc. Perhaps if one is clear that they have the confidence prior to the appearance of the interval and not afterwards things would be ok.

So while I agree that you that the statement is not strictly correct in a general case and I further agree that it should not be taught as a way of stating the CI. I’m not sure I agree that it’s fairly tested in the given experiment.

• Andrew says:

John:

Yes, I agree that it depends on context. In lots of simple problems, the “we can be 95% confident” formulation is just fine. I was pushing back against Russ’s statement that classical confidence intervals are “the very meaning of ‘confidence’.” But in many cases, all definitions will approximately agree.

But to get back to my point in the above post, I’d argue that we use flat priors (or the equivalent classical procedures) in all sorts of problems where they’re not appropriate. My own thinking on this has indeed changed a lot in recent years, and my latest views have not even fully made their way into the latest version of Bayesian Data Analysis.

• Mark says:

In the case of the procedure that is either the real line or the empty set, I think your concern is about power. This procedure will only reject a false hypothesis 5% of the time. In that sense that tests with poor power should be suspect, you are certainly justified in being skeptical of someone claiming 95\% confidence in either a real line or empty set, but I don’t think the language of “confidence” is inappropriate, at least as a term of art in statistical communication.

Mayo and Spanos argue for amending error statistical procedures (i.e., hypothesis tests, p-values, etc) with the concept of “severity.” A severe test combines the pre-data type I and type II properties of the test, along with a post-data assessment of how well other hypotheses could have generated the observed data. Perhaps this is a way forwards.

Mayo, D. G. & Spanos, A. “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction.” The British Journal for the Philosophy of Science, June 2006, 57, 323-357

• Andrew says:

Mark:

I know all about that. My point in the above comment was only regarding the use of the phrase, “we can be 95% confident.” I used a simple example to make the point clear, but it could also be made regarding any confidence interval of varying width. I don’t mind the term “confidence interval” having a specific technical meaning having to do with repeated sampling (indeed, it’s fine with me for that sometimes-empty, sometimes-the-whole-line interval to be called a “confidence interval”), but I do mind the term “we can be 95% confident” used in such settings. To me, the phrase “we can be 95% confident” is already too deeply embedded in English for it to be given this awkward technical interpretation.

• Anonymous says:

I also agree that the word “confident” is encumbered with probabilistic connotations in English that make it difficult to see that students would universally be able to think of it as technical jargon that means something about repeated sampling.

And I am also bothered that textbooks often refer to both the interval estimator and the interval estimate as “confidence intervals”. Wouldn’t it cause less confusion to students to remind them that the .95 probability is for a statement that involves the interval estimator, and not the interval estimate?

• Hi, Andrew.

Let’s stick to what I wrote, at least, at first. Do you disagree with me concerning the case where this paper used the word “confident”? The authors say that the usage is incorrect. I asserted the usage was correct. What is your opinion?

Second, I gave two quotes from FPP concerning use of SEs. Surely you agree with the first (don’t use formulas for SRSs mechanically in other situations). You seem to disagree with the second. Does that mean you believe that “With samples of convenience, standard errors usually do make sense”?

After we settle these issues, we can go on to the additional things you wrote.

Thanks.

• Andrew says:

Russ:

1. I didn’t read the linked paper so I don’t know about that particular usage. What I was objecting to was what I saw as your general implication that “we can be 95% confident” is a correct statement whenever it is referring to a valid confidence interval. But in the particular example in that paper, maybe I’d agree that the usage is ok, I don’t know.

2. I disagree with both statements: “The formulas for simple random samples should not be applied mechanically to other kinds of samples” and “With samples of convenience, standard errors usually do not make sense.” Or perhaps I should say that these statements are not well defined: I don’t know what is meant by “mechanically” nor do I know what population is being referred to by “usually.” I think there are some settings where you can get bad answers by applying simple formulas to nonrandom samples and other settings where you can get good answers. And, as I discussed in my earlier comment, I do think that standard errors can make sense with samples of convenience, as long as they are interpreted appropriately.

Are the two sentences you quoted good advice? Maybe, it depends on how they are interpreted. Again, I’m not saying this makes it a bad book, it just means you have to be careful in taking these statements too seriously. Good advice for an intro book does not always translate into good advice for practice.

• Andrew,

The questionnaire asked this: “… conducts an experiment, analyzes the data, and reports: The 95% confidence interval for the mean ranges from 0.1 to 0.4!”. The respondent then must choose true or false for each of 6 options. I quoted #5. What do you think?

Surely you do not think that a formula for SRS should be used for multistage cluster sampling. This is the kind of thing that FPP mean. They go on to include convenience (and other) samples. This seems to be your concern. You say “But I wouldn’t say that standard errors from such nonrandom samples “do not make sense”; rather, I’d say that they can be interpreted as sampling variability relative to the population that the sample represents (which is not, unfortunately, the same as the population of interest).” Please explain what population a convenience sample represents and in what way an SE calculated as for an SRS makes sense for a convenience sample. You seem to say it represents “sampling variability”; what does that mean for a convenience sample and what does such an SE say about it?

Thanks.

• Andrew says:

Russell:

Actually, the simple random sampling formula is not always so bad, even for multistage cluster sampling. For example, the General Social Survey and the National Election Study both use multistage cluster sampling but it is standard practice to ignore that in analyses. A lot depends on context.

Regarding your question about convenience samples: consider the recent example of a psychology paper that was based on a survey of 100 people from Amazon Mechanical Turk. I’d treat these as a sample from the population of people on Mechanical Turk who would respond to this sort of solicitation. This population is not clearly defined but I still find it helpful to think about the respondents as a sample. I’m much more interested in the population, however poorly defined it is, than in the 100 people in the sample. Also, sometimes it is possible to adjust for the mismatch between different populations; see this paper on forecasting elections with nonrepresentative polls.

Let me emphasize that I am not saying that the textbook you are citing is giving bad advice. In a book of that sort it is important to emphasize principles and there isn’t really the space to explain all the complexities of real statistics. Similarly, an intro physics textbook will talk about frictionless surfaces. This is fine. It’s not a criticism of a textbook to say that it oversimplifies. You just have to be careful to realize that the simplification is there.

• Andrew,

I will look at the paper you link to. Meanwhile, on the issues of the paper of the blog post, here is their sample:

“Our sample consisted of 442 bachelor students, 34 master students, and 120 researchers (i.e., PhD students and faculty). The bachelor students were first-year psychology students attending an introductory statistics class at the University of Amsterdam. These students had not yet taken any class on inferential statistics as part of their studies. The master students were completing a degree in psychology at the University of Amsterdam and, as such, had received a substantial amount of education on statistical inference in the previous 3 years. The researchers came from the universities of Groningen (n = 49), Amsterdam (n = 44), and Tilburg (n = 27).”

What population is relevant here? How are the SEs of SRSs relevant?

You did not yet give your opinion on their option 5.

You say, “the simple random sampling formula is not always so bad”. This does not contradict in any way my quote from FPP. You appear to disagree, but I am unclear why want to give that appearance.

You speak of “simplifications”, but can you point to one in the quotes I gave? You are always arguing against a straw man, as far as I can tell.

Thanks.

• Andrew says:

Russ:

1. I would think of the 500 people in that study as being representative of some hypothetical population of Dutch psychology students and researchers. The standard errors are relevant because a different study could have reached a different sample of students.

2. Again, I disagree with the statement, “The formulas for simple random samples should not be applied mechanically to other kinds of samples,” because I and other practitioners use the formulas for simple random samples all the time and we just about never actually have simple random samples. And I disagree with the statement, “With samples of convenience, standard errors usually do not make sense.” Or maybe, on that one, I have no idea what the textbook writers meant by “usually.” If I define “usually” as “most of the examples I’ve seen,” I’d say that standard errors usually do make sense, but they should be interpreted with respect to the hypothetical population represented by the sample.

If someone reads those sentences from the textbook (or in this blog discussion) and takes away the message, “Hey, I should be careful about departures from simple random sampling! To the extent that my sampling design is complicated, or (more likely) that I have selection in response and selection in nonresponse, I should be careful about how to interpret estimates and standard errors,” then that’s fine with me.

If someone reads those sentences from the textbook (or in this blog discussion) and takes away the message, “If it’s not a simple random sample, the formulas are probably useless. If the sample is self-selected,
standard errors are probably useless,” then that bothers me.

• Andrew,

I’ve now read that paper you link to. It is pretty nice, though I did not understand all the details. The graphics are excellent. I have 3 questions, however:

1. How do your forecasts compare to Nate Silver’s?

2. How is this related to the issue of using SEs from SRSs? Is that what you used? I think you used SEs from your model.

3. Do you propose attempting similar techniques in several situations where one can discern how well they do, before using them in new situations where they do not have competitors?

• Andrew says:

Russ:

1. I haven’t looked at Nate’s forecasts in detail but I think they’re comparable to those from pollster.com. The point in our paper is not that our Xbox estimates are better than what is obtained by combining a bunch of telephone polls (we happened to get a better result on election day but that’s N=1) but just that our results are comparable. Comparisons to Nate’s forecasts would look similar.

2. The estimates and standard errors from our model came through assuming simple random sampling within poststratification cells, an assumption which is obviously false. This (and every other random sampling assumption) is also false in every telephone poll and every face-to-face poll. All surveys have problems of undercoverage and nonresponse. The random sampling model is a mathematical convenience but is false.

3. I think new techniques should and will be developed in parallel with old techniques. All these methods have flaws. The newspapers keep reporting exit polls and they’re way way off, I recall hearing that they overestimated the Democrats’ share in recent elections by 10 or 15%.

• K? O'Rourke says:

Andrew: > I would not take the FPP book as an authority on statistics

I would and have. I was stuck with it, when teaching undergrads at Duke (they had used it for many years before).
It is an introductory book but it is surprisingly subtle and deep. For instance, it has a chapter on the strict Fisher null and an appendix entry on the Neyman null. (The teacher’s manual helps clarify what they were up to which may not be that apparent on a first reading.)

It is badly out of date (pre computer era), they did overstate the “no inference without randomisation” and they appeared Bayes phobic (but maybe largely due to pre computer era.) The confidence word, I believe was introduced as a non-wrong way to refer to the usualness properties of the procedure for a particular interval. The first author was very aware of relevant subsets and other failures of confidence interval procedures http://en.wikipedia.org/wiki/David_A._Freedman_(statistician)

Senn points to what I believe is the more important problem – what do people working with statistics actually think they are doing and especially why. The wording of the 6 false statements, to me, suggests Morey et al were not really assessing that. What would be better would be anonymous short interviews of researchers and statisticians. I believe that would confirm Senn’s fears or even worse.

• K?,

Could you please elaborate on your comment, “It is badly out of date (pre computer era)”? In my opinion, adding computers would not make it a better textbook for the intended audience, but only make it worse. However, I don’t know what you are thinking, so I won’t detail why until I hear your reasons.

Thanks.

• Andrew,

There seems to be a limit on nesting, so I am replying to my original post.

You write:

1. I would think of the 500 people in that study as being representative of some hypothetical population of Dutch psychology students and researchers. The standard errors are relevant because a different study could have reached a different sample of students.

2. Again, I disagree with the statement, “The formulas for simple random samples should not be applied mechanically to other kinds of samples,” because I and other practitioners use the formulas for simple random samples all the time and we just about never actually have simple random samples.

As for #1, why would anyone be interested in a hypothetical population? They seem to be imaginary. In what way is the sample representative of that hypothetical pop.? How do the SEs help in assessing how different the results may be for a different sample?

As for #2, that is pretty interesting, but practice does not make perfect; more precisely, what is the justification?

Finally, these points are about Mark’s point. What about my point? You have still not given your opinion on option 5.

Thanks.

• Mark (different one) says:

I agree with Russell.

• Andrew says:

Russ:

Regarding textbook presentations, consider the analogy of intro physics, and consider two ways of writing things:

(a) We assume a frictionless surface. In real-world situations of objects on Earth, there will be friction. The equations in this book can still be used but you have to be careful because if there’s a lot of friction, you’ll need to account for it. It’s possible to account for friction using the same laws of physics but the models get more complicated.

or

(b) The expressions in this book should not be applied mechanically if there is friction. When there is friction, these formulas usually don’t make sense.

I prefer (a) to (b), but the sentences you gave me from that textbook look more like (b). Now, I see some pedagogical advantages to (b)—for one thing, it’s shorter, which counts for a lot right there—but as a practitioner, it bothers me because I almost never work with random samples, just as real-world engineers almost never work in the absence of friction.

You comment that “practice does not make perfect,” but that’s not the point. If you really required mechanical perfection, you’d have to discard essentially every survey of human populations that’s ever been done. Response rates for surveys can be 10% or lower—but even if response rates were 50% or even 90%, that’s still not random sampling.

Again, the simple random sampling formula is not always so bad, even for multistage cluster sampling. For example, the General Social Survey and the National Election Study both use multistage cluster sampling but it is standard practice to ignore that in analyses. A lot depends on context. But if you were to take textbook warnings too readily, you might have to give up on analyzing these two very important surveys (which use cluster sampling but give very little detail to users on the clustering in the design or the data). The real world has friction but simple Newtonian mechanics ignoring friction can still do a lot for us.

Regarding your quote from the linked paper, you write that it’s ok to say “We can be 95% confident that the true mean lies between 0.1 and 0.4.” You ask me if I think that statement is a mistake. My answer is complicated. In the context of a multiple-choice quiz, I can see how a student would give that answer. Strictly speaking, I don’t think the statement is in general correct. I think the correct statement is “Under repeated sampling, there is a 95% probability that the true mean lies between the lower and upper bounds of the interval.” Once you put in the actual values 0.1 and 0.4, it becomes a conditional statement and it is not in general true. If a student gave that answer, I wouldn’t be super-upset, but if you had to pin me down, I’d say it’s not quite right and indeed that it represents a misunderstanding. The paper of mine that I link to in the above post gives some sense why I’m bothered by the statement. In short: I and others have sometimes reacted to the well-known problems of hypothesis testing and p-values by suggesting that researchers report confidence intervals instead. But, for many problems, confidence intervals have problems too.

• chuck says:

As an engineer and sometimes teacher, I prefer (a)”We assume a frictionless surface. In real-world situations of objects on Earth, there will be friction. The equations in this book can still be used but you have to be careful because if there’s a lot of friction, you’ll need to account for it. It’s possible to account for friction using the same laws of physics but the models get more complicated.”

It is longer but gives the student a hint that there is a solution and the solace that the models they are studying are relevant—but perhaps missing a few terms that complicate the exposition.

• Andrew,

The quotes you are discussing from FPP came from the summary page at the end of the chapter. There was much more discussion in the text of the issues.

My phrase “practice does not make perfect” was a joke. The meaning is that just because you “use the formulas for simple random samples all the time and we just about never actually have simple random samples” does not mean it is justified. I clarified to ask about the justification. So far, the only justification you have given is that it allows people to write papers. That’s not a very strong justification. In terms of physicists, who approximate for a living, they have experiments to justify what they do.

You have discussed the term “confidence interval” and you say that the word “confident” should not be used, disagreeing with me (and FPP). I gave a definition of this term, which made it correct usage. What is your definition?

Thanks.

• Andrew says:

Russ:

I appreciate your patience in continuing this discussion.

I think we’re talking here about the use of formulas derived under the assumption of random sampling to be used for nonrandom samples. My justification of this of course is not “that it allows people to write papers.” Let me repeat: almost every sample I’ve ever seen is not random. Every once in awhile I’ve been involved in simple settings with exact random sampling, for example when there is a long list of legal records to investigate, and we take a random sample from the list and only the records in that sample are investigated, but that’s an unusual situation. Samples of human populations are essentially never random.

As with just about any application of mathematics to the real world, here are different justifications for making an approximation. Certain methods are validated, for example we can compare pre-election polls to election outcomes. Other times we can ask questions of different populations and in different ways and look for convergence. Such convergence does not always happen, for example there remains a debate about the so-called U-curve in happiness (we’ve discussed this on the blog from time to time as you can see by searching the blog on *happiness* or *Andrew Oswald*, for example), which appears in some analyses but not in others. Another example would be the surveys used by the government to estimate employment and unemployment. These surveys are not perfect; like all surveys of human populations they have problems of measurement error and missing data. And there are lots of crappy surveys out there too. Hence I see the value in a textbook of warning users not to mechanically apply a formula without understanding it. It’s a good thing to understand the limits of any approximation. But I think the statement “The formulas for simple random samples should not be applied mechanically to other kinds of samples,” because to me it seems to imply that simple random samples really exist in the real world. Again, consider the friction analogy. Rather than saying, “The formulas for frictionless Newtonian mechanics should not be applied mechanically to problems with friction,” I’d rather say, “Just about all problems have friction, thus the formulas for frictionless Newtonian mechanics are necessarily an approximation. In more advanced courses you’ll learn how you can approximately adjust for friction. For now, be warned, and recognize that by learning the simplest case you are getting both a conceptual understanding of mechanics, along with some tools that should work well in those problems where friction is relatively minor.” The statistics version might be, “Just about all surveys are nonrandom samples, thus the formulas for simple random samples are necessarily an approximation. In more advanced courses you’ll learn how you can approximately adjust for nonrandom sampling. For now, be warned, and recognize that by learning the simplest case you are getting both a conceptual understanding of sampling, along with some tools that should work well in those problems where the nonrandomness of the sampling is relatively minor.”

You write that physicists “approximate for a living.” So do statisticians! So does the entire staff of the Gallup Poll, Pew Research, the Bureau of Labor Statistics, the Bureau of Justice Statistics, and the U.S. Census Bureau. Laplace was a statistician and approximated for the French government back around 1800.

Finally, you ask what my mathematical definition is of “We can be 95% confident.” I don’t think I can give a definition that will satisfy you and also satisfy my understanding of the English language. My sense is that, in English, saying that you’re 95% confidence that a parameter is between 0.1 and 0.4 is a statement about that particular parameter. If I wanted to make a statement that is valid on average over repeated sampling, I’d either specifically say something like “With repeated sampling…” or else I’d use a term such as “confidence interval” or “statistical rejection region” that seems to me to clearly imply the use of a technical term. To me, the word “confident” aligns too much with common English usage and I’m not comfortable trying to overwrite that usage with a technical meaning. In contrast, the phrase “confidence interval” seems more clearly technical and I’m OK with it being used in a technical sense (although of course it does confuse people, but that’s a separate issue).

• Andrew,

There are two main points I/we are discussing, Mark’s and mine. You had additional points that are related.

Concerning Mark’s point, allow me to quote the paper under discussion: “Before proceeding, it is important to recall the correct definition of a CI. A CI is a numerical interval constructed around the estimate of a parameter. Such an interval does not, however, directly indicate a property of the parameter; instead, it indicates a property of the procedure, as is typical for a frequentist technique. Specifically, we may find that a particular procedure,
when used repeatedly across a series of hypothetical data sets (i.e., the sample space), yields intervals that contain the true parameter value in 95 % of the cases. When such a procedure is applied to a particular data set, the resulting interval is said to be a 95 % CI.”

Mark’s point is that the paper did not use “95% CI” in that way, i.e., they did not practice what they preach. What do you think? If you agree with Mark (and me), do you say that they did well in practicing differently than they preach, and, perhaps, that they ought to preach differently? This relates to my question asking you to justify practice. Do you say that the way you use it, 95% CIs do in fact cover the parameter 95% of the time? If so, how do you know? If not, what does “95% CI” mean, other than that in another situation, like SRS, it would cover it 95% of the time? I mentioned physicists because you did and because they have a clear way to test their approximations (as do chemists and engineers, among others). Do you also have a clear way? If not, would you advise inflating a CI to be safe?

My point, though, was that “confident” is an adjective that applies to people to mean the same thing that “confidence” means when used as an adjective to apply to “interval”. You feel “confident” should not be used; that’s fine. However, as others have said, I think it cannot be helped; it is too late. The important point is that it is not the same as probability, just as the paper under discussion says.

Thanks.

• Andrew says:

Russ:

No, my problem is that people are using the expression “95% confident” for confidence intervals but then the resulting statements are being taken as probability statements. It is a misunderstanding that I think could be reduced by appending “Under repeated sampling” to all such statements. Of course if misleading uses indeed cannot be helped, then we’re stuck. But in that case I think it can be useful for researchers and educators to point out the ways in which people can get confused, so that, for example, writers of reports can avoid the use of expressions that are likely to mislead.

The background to all this is that p-values are notorious for being easy to misinterpret (for example, being described as the probability that the null hypothesis is true) and, for many years, it’s been suggested that we replace hypothesis tests by confidence intervals as the latter more directly convey uncertainty. But confidence intervals have their own problems, as noted in my paper that I link to in the above post. I don’t necessarily agree with everything in the Hoekstra et al. paper but I thought it worth linking to because it related to this important discussion.

Finally, regarding physicists, chemists, and engineers: they can test their assumptions, and so can statisticians. There’s been lots of work in all these fields on making predictions and then testing them. It’s well known that 95% of the confidence intervals in physics do not cover the true parameter 95% of the time. And this is no surprise because these intervals are based on assumptions which are not quite correct. Even when researchers try to correct for measurement error, there are typically other sources of measurement error that get missed. Just as 95% confidence intervals don’t really contain the truth 95% of the time, similarly, Bayesian posterior intervals don’t contain the truth 95% of the time either, as Bayesian intervals too are based on assumptions. Probabilistic statements are model-based. And, yes, in some contexts such predictions can be calibrated and the probabilities adjusted. These are standard problems in applied statistics.

• Andrew,

Once again I must reply to one of my own posts.

Perhaps you will appreciate the use of “confident” by a book for lawyers:

In any case, naturally I agree with you that the term is subject to misinterpretation, and that such misinterpretation often occurs. Indeed, as you well know, this is the case for many other terms in statistics. However, we do not give up all such terms. And of course it is important to point out misuses; indeed, FPP is the best book for doing so. The question was not whether “confident” was the best term to use in a report, but whether it was a correct way to rephrase a certain statement.

I know that 95% CIs that physicists give don’t always work, but that’s an extremely small part of the ways that physicists approximate things. Most of what they do does not involve statistics at all, except in repeated experiments. The theory itself is an approximation; even when it is thought to be exact, it is generally used only by making approximations. That’s what I was referring to.

I am still interested in your answers to the questions I asked in my previous post. I was trying to ask very precise questions in order to focus the discussion.

Thanks.

• Oops; I tried to use html for a link but goofed. Here it is (I hope): Statistics for Lawyers
By Michael O. Finkelstein, Bruce Levin

• Andrew says:

Russ:

You ask: “Do you say that the way you use it, 95% CIs do in fact cover the parameter 95% of the time?”

My response: To the extent I used 95% confidence intervals, no, I do not think they cover the parameter 95% of the time. I don’t think other people’s confidence intervals cover the parameter 95% of the time, and for the same reason, that the coverage is conditional on a model, and the models are violated.

You ask: “If not, what does ‘95% CI’ mean, other than that in another situation, like SRS, it would cover it 95% of the time?”

My response: “95% CI” refers to a procedure that is intended to have that coverage conditional on a set of assumptions. For example, the assumptions of random sampling essentially never occur in real surveys of human populations, but these assumptions can still be useful in helping us develop and understand statistical procedures.

You ask: “Do you also have a clear way [to test approximations]”?

My response: Yes, we can and do test approximations in various ways. This is something that statisticians have been doing for a long time. One way to test approximations is internally by checking the model’s fit to data; another way is to test externally by looking at calibration of predictions. I’ve published papers using both these methods.

In short, statistics is applied mathematics. Statistical methods rely on assumptions which are generally false but sometimes are reasonable approximations.

• Andrew,

Thanks for your direct replies. You did not answer my first pair of questions, but I infer that you feel that the authors used CIs in an acceptable fashion. For example, they say, “The mean numbers of items endorsed for … researchers were … 3.45 (99 % CI = [3.08, 3.82])”. We discussed earlier their sample. You say, “‘95% CI’ refers to a procedure that is intended to have that coverage conditional on a set of assumptions.” We know that those assumptions were not met even approximately. So could you please tell me of what use this interpretation is in this case? Of what value is [3.08, 3.82] here (which, btw, is a “99% CI”, not 95%)? I simply don’t see how speaking of 95% (or 99%) of similar samples would be useful, unless you believe that if other researchers used their own convenience samples, they would obey these statistics approximately.

You also say that 95% CIs do not cover the parameter 95% of the time. How often would you say they do and how do you know?

Thanks.

• Andrew says:

Russ:

The less-than-nominal coverage of confidence intervals in physics is well known; see, for example, the classic papers by Youden (1962) and Henrion and Fischoff (1986) referred to in section 2 of this paper. As I wrote earlier, it’s not a surprise; of course there’s going to be measurement error that’s not included in the model.

• Andrew,

I know it is well known. It is known even by me, as I wrote. Perhaps you misread what I wrote while catching up on my posts.

Looking forward to your response to my last post (where the topic is the coverage of CIs for samples of convenience).

Thanks.

• Andrew says:

Russ:

You wrote, “You also say that 95% CIs do not cover the parameter 95% of the time. How often would you say they do and how do you know?”

I wrote, “The less-than-nominal coverage of confidence intervals in physics is well known; see, for example, the classic papers by Youden (1962) and Henrion and Fischoff (1986) referred to in section 2 of this paper.” That’s about the best I can do for you. I don’t know of any global study by which I could say, “We estimate that 72% of all published 95% intervals contain the true value.”

Regarding convenience samples, there is a population of interest and there is a population that is being sampled from. Differences between these populations can be important. A study of 100 self-selected internet participants can be representative of the sort of people who would select into that sort of survey, without being representative of the general U.S. population.

Remember that essentially all samples of human populations are convenience samples. The statistical theory of sampling is developed based random samples, balls from urns and all that, but real samples are not random in that sense. Just like real wheels have friction. I have the feeling that nothing I will say here will satisfy you, but in the meantime we’ll still be using data from the Bureau of Labor Statistics, General Social Survey, etc.. Yes these surveys have nonsampling error but I’m glad that users also report standard errors of sampling as well!

• Anonymous says:

Interesting… I would have thought that the statement on the questionnaire that would have caused the most discussion among readers of this blog would have been:

“3. The “null hypothesis” that the true mean equals 0 is likely to be incorrect.”

I would have said that statement was TRUE without even looking at the interval estimate! That isn’t a probabilistic statement, either, so I am not sure why the authors of the paper characterize it as one.

Another statement,

“2. The probability that the true mean equals 0 is smaller than 5%.”

I might also declare that one TRUE, even before referring to the interval estimate. Even when I put on my “frequentist” hat, I don’t have trouble assigning probabilities to statements involving the mean parameter, but my choices for those probabilities would be limited to either zero or one. In this case, zero.

4. John says:

The paper makes some good points and it does demonstrate there are issues with the understanding of CI’s, unfortunately it takes it too far and overstates it’s case. The first thing to note is that, compared to the comparison p-value paper, the problems are fewer. Some of those test statements are pretty much directly out of textbooks and people are generally less educated about CI’s or haven’t even seen them before. So the gap may be able to be made wider with education. But most importantly one of their answers can be interpreted as correct, or at least no definitely incorrect.

The “null hypothesis” that the true mean equals 0 is likely to be incorrect.

No, that’s not the literal definition of CI (which is the only interpretation they want to accept). But it’s not wrong based on the information given. It’s just not definitely correct.

One context in which CIs have come up is in the context of research that shows that the literature has more findings right at or just below p<.05 than would be expected. The authors of an article on this pheneomena recommend that CIs and effect sizes be given greater attention as a possible way to remediate the situation.

The life of p: “Just significant” results are on the rise
Nathan C. Leggett, Nicole A. Thomas, Tobias Loetscher, Michael E. R. Nicholls
The Quarterly Journal of Experimental Psychology
Vol. 66, Iss. 12, 2013

However, in consideration of the findings mentioned here, that does not seem likely to work out terribly well.

I came across a paper that appears to have this property. The results of interest are significant conventionally, but appear to be just barely so. This paper does have several admirable features (e.g. longitudinal data) and does not look like it's p-hacking or anything.

Sex differences in the implications of partner physical attractiveness for the trajectory of marital satisfaction.
Meltzer, Andrea L.; McNulty, James K.; Jackson, Grace L.; Karney, Benjamin R.
Journal of Personality and Social Psychology, Vol 106(3), Mar 2014, 418-428. doi: 10.1037/a0034424

The authors presented effect sizes, but did not really discuss them. On the one hand, it would seem better to have presented CIs along with p values, and I requested them. However, reporting numbers without much understanding and forgoing discussion does not seem like it would be all that informative for readers. One other important consideration in terms of CIs and what researchers do and do not know about them is the editorial process. If journal editors and referees can not make sense of CIs, then the promise that many scientists see in reporting CIs to deal with the p value problems is unlikely to materialize.

6. Andrew,

I don’t think confidence intervals that are empty or the entire real line pose such big problems of interpretation. Let me try rewording your versions above.

You say: ‘…if the interval happens to be the empty set, should someone say, “We can be 95% confident that the parameter is in the empty interval”?’

How about: “We can be 95% confident that there is no value for the parameter that is consistent with the assumptions of our model.”

You say: ‘…if the interval happens to be the whole line, should someone say, “We can be 95% confident that the parameter is a real number”?’

How about: “We can be 95% confident that we can say nothing useful about the parameter based on our data and model.”

These versions seem to address your concern that it doesn’t make sense to say “you’re 95% confident about a statement that you know to be false, or you know to be true.”

I confess I’m not entirely happy with the second reformulation. I am looking for something that corresponds to “unidentified”, and “say nothing useful about” is the best I could think of just now. But the first one seems very reasonable to me; it’s basically how one would interpret a specification test that rejects the model with a p-value of 5%. An empty confidence interval implies that the model itself is questionable.

Confidence intervals that are empty or the entire real line come up in the literature on “weak identification”. The paper that I have at hand is Stock and Wright, “GMM with Weak Identification”, Econometrica, Vol. 68, No. 5, Sept 2000, DOI: 10.1111/1468-0262.00151. There is a brief discussion on pp. 1064-5 about this with respect to the Anderson-Rubin (1949) statistic and its GMM extension.

–Mark

• Andrew says:

Mark:

I agree with you. The intervals are what they are. I just don’t like the “we can be 95% confident” formulation but I agree that they can be described accurately.

Intervals that can be empty do run into problems of interpretation, though; see here.

• Andrew,

I don’t think the problems you point to in that blog entry are huge, either, at least in principle. You say

“…when you can reject the model, the confidence interval is empty. That’s ok since the model doesn’t fit the data anyway. The bad news is that when you’re close to being able to reject the model, the confidence interval is very small, hence implying precise inferences in the very situation where you’d really rather have less confidence!”

which is quite right. But this is a limitation that follows from looking at just one confidence level/interval.

The way to deal with the problem is to look at more than one level, and (I expect you’ll like this) the easy way to do that is graphically: plot the rejection probability R (=1-p) on the vertical axis and the hypothesized parameter value theta on the horizontal axis. The (say) 95% confidence interval is the region of theta where R<0.95.

You can get a very small CI if R just dips below 0.95 before coming back up, and this is the worrisome case you mention, where the model itself is questionable. If you chose 90% instead of 95% for your CI, it would be empty. On the other hand hand, you might get a very small CI if R plunges down almost to 0 before zooming back up again. This is case where you do indeed have a precise estimate and the validity of the model isn't such a worry.

The difference between the two cases is easy to see graphically. If your particular estimation is computationally intensive, you could just report CIs for a range of levels. But the graphs are nicer on the eye as well as more informative. And if you have not-so-nice cases such as where R doesn't have a simple inverse-U shape and things like disjoint CIs are possible, graphing R is the best way of seeing what's going on.

–Mark

• Richard D. Morey says:

Mark, the problem with the formulation of “confidence” in terms of “usefulness” or “consistency with assumptions” is that this is not what a confidence interval tells you. There are examples of confidence intervals, for instance, that are perfectly reasonable frequentist confidence intervals – that is, they are “short” intervals, they have proper coverage, etc. but that include mostly values that are known to be *impossible* on the basis of the model and the data. The same kinds of confidence intervals might sometimes exclude almost of the values you know to be plausible (and the values outside the interval are just as plausible as the ones inside!).

A confidence interval is an algorithm for computing two numbers such that, if one used the algorithms in repeated samples, the true value will be included in the interval X% of the time, and nothing more.

• I don’t think that’s the only way to look at what a confidence interval is. As Larry Wasserman put it in another discussion on Andrew’s blog (http://andrewgelman.com/2013/06/24/why-it-doesnt-make-sense-in-general-to-form-confidence-intervals-by-inverting-hypothesis-tests/#comment-147455):

“Every test defines a confidence interval and
every confidence interval defines a test.
Every confidence interval can be viewed as inverting a test.”

Can’t you interpret an empty (1-alpha) confidence interval as the result of: (1) the inversion of a specification test at the alpha significance level, where the hypothesis tested is that all the restrictions of the model (including the hypothesized value of theta) are satisfied, and (2) the test is rejected for any value of theta?

That’s how the Stock & Wright (2000) Econometrica paper I cited above interprets empty confidence intervals. In their multiple-parameter setting, an “S-set” is a confidence set for the multiple thetas. At the bottom of p. 1064: “If the model is misspecified so that the overidentifying restrictions [included in the model] are invalid, S-sets can be null.” Seems a perfectly reasonable, and conventional, interpretation.

–Mark

PS: Andrew, apologies – when I looked at the thread from the 2013 blog discussion, I found I was making the points about graphical interpretations of empty confidence sets. Should have linked to/referenced that the first time around, I think.

• More apologies – I didn’t spell out the Stock & Wright (2000) example properly. In their setup, the null hypothesis is a joint one: (a) theta=theta0, and (b) the overidentifying restrictions are satisfied, where thetas0 is some hypothesized value for the parameter vector theta. The S-set for theta can be empty if there are no possible values for the theta0 such that the test doesn’t reject.

–Mark

• Richard D. Morey says:

> Can’t you interpret an empty (1-alpha) confidence interval as the result of: (1) the inversion of a specification test at the alpha significance level, where the hypothesis tested is that all the restrictions of the model (including the hypothesized value of theta) are satisfied, and (2) the test is rejected for any value of theta?

Sure, but in order to get the interpretation of CI in terms of “usefulness” or “consistency with assumptions”, you’ve got to interpret the *significance test* in terms of “usefulness” or “consistency with assumptions”. That’s not the way significance tests should be interpreted, either.

I agree that CIs can be interpreted in terms of inverted tests, but that’s really just a restatement of the idea of confidence since the test itself is defined in terms of how often the true value will be rejected.

• Seems to me you certainly [sic – sorry, couldn’t resist] can interpret a significance test in terms of “consistency with assumptions”. For example, that’s how I interpret what Stock and Wright do in their 2000 Econometrica paper. Their test/confidence sets aren’t new – their paper generalizes the test and confidence sets introduced by Anderson and Rubin in 1949 (!) (full text open access -https://projecteuclid.org/download/pdf_1/euclid.aoms/1177730090).

In a simplified the Stock-Wright example:

H0: (a) theta=theta0 and (b) E(Ze)=0 [the latter is the assumption that the Zs are orthogonal to the error e in a regression model]

If you reject H0 at some chosen significance level, the interpretation is either that you are rejecting either (a), (b) or both.

Hence S&W write on p. 1064: “The S-sets consist of parameter values at which one fails to reject the joint hypothesis that theta=theta0 _and_ that the overidentifying restrictions are valid. This has some appealing consequences, but also requires care in interpretation. If the model is misspecified so that the overidentifying restrictions [the E(Ze) assumption above] are invalid, S-sets can be null.”

So here’s a legit example of interpreting a “significance test” in terms of “consistency with assumptions”. Seems perfectly reasonable and fairly conventional in frequentist terms.

–Mark

• Richard D. Morey says:

> Seems perfectly reasonable and fairly conventional in frequentist terms.

Yes, and there’s the problem. In frequentist terms, “rejection” of the hypothesis only means that you’ve taken some action that you know would you would reject X% of the time, if the hypothesis were true. It does not mean you have grounds to disbelieve the hypothesis (or the assumptions), or that your data were “inconsistent” with the assumptions. Frequentist terms like “reject” and “confidence” have special technical definitions; they don’t mean what they normally mean in everyday speech (in fact, one could argue that the continued popularity of such methods is due to conflating the technical and lay definitions of such words…).

That’s why examples of confidence intervals that don’t include most of the plausible values are so telling. You’d “reject” the values outside the CI as being “inconsistent with the hypothesis”, but they might be as consistent with the assumptions as any value inside the confidence interval.

• Richard D. Morey says:

eh, “know would you would reject” -> “know you would take” in the second sentence…

• I don’t see any problem here. In the Stock-Watson-Anderson-Rubin example I cited, H0 has two parts, (a) and (b). (a) relates to the value of theta. (b) is another assumption. Together they imply a test statistic with a particular distribution, which allows both testing and construction of CIs. If for some dataset the test “rejects” every possible hypothetical theta, this is indeed telling, and focuses attention on (b). Which is what S&W are referring to when they talk about what can happen “if the model is misspecified”. All quite conventional in frequentist terms.

My point isn’t that there aren’t any general issues with the frequentist approach (I’ve just about convinced myself that in an ideal world, we’d be teaching undergraduates Bayesian statistics with the frequentist approach as an optional add-on, and not the other way around as we mostly now do). Rather, I was objecting to a particular point made by Andrew. It seems to me that empty CIs, or CIs that are the entire real line, don’t necessarily pose particular problems for frequentist interpretation.

–Mark

Not to be too picky because I love the approach and the title. I mean, c’mon, everyone has to agree that “Robust Misinterpretation” is an awesome start to the title. But I’ve run into issues with students where I’ll do a short T/F quiz (not just in stats but in my other courses such as research methods) where I’ll make all the answers true or all the answers false and it’s rare that folks will get 100% on these assessments. Students just seem to believe that there’s no way everything could be either all true or all false. So I would characterize the folks who got only one answer wrong (and maybe two) as the same as the few folks who got none wrong simply from the test format rather than seeing them as different.

So if you did this then the rates for having one or less correct answers is 8% for undergrads, 24% for masters, and 12% for researchers, which for the students probably matches their grade they got in stats (i.e., they’re better students) but is still very troubling for researchers that 88% got two or more wrong and 74% got three or more wrong. Yes, I know generalizability is an issue but sigh…

• Nick Menzies says:

I agree. I have been conditioned by enough standardized testing that given this question format I would expect at least one option to be true (or at least one to be false..), so if a respondent is sure about the correct response to all be one statement, it would be very easy to assume that statement breaks the other way.

8. Stephen Senn says:

In my experience there is an even worse and common misinterpretation of confidence intervals by medical researchers that Hoekstra et al did not even consider. I often come across the interpretation that if from a clinical trial the confidence interval active treatment minus placebo has limits (say) 0.1 to 0.4 that this means that the true treatment effect (active -placebo) varies at the individual level between these limits with 95% probability. Of course this is ludicrous since a) individual effects are not indentifiable in parallel group trials and b) the confidence interval gets narrower and narrower as the sample size increases but variation at the individual level would not change.

From memory, I think that this misunderstanding is covered in Andy Grieve’s little book FAQs on Statistics in Clinical Trials. The common misunderstanding about identifiability of individual effects, which goes right up to the top of the pharmaceutical industry (whose CEOs have been betting the house on personalised medicine due to their misunderstanding), is treated in Senn, S.
(2009) Three things that every medical writer should know about statistics. The Write Stuff, 18 (3). pp. 159-162. ISSN 1854-8466 http://eprints.gla.ac.uk/8107/1/id8107.pdf.

See also Senn S. Being Efficient About Efficacy Estimation. Statistics in Biopharmaceutical Research 2013; 5: 204-210 for evidence that even the FDA may not be totally immune.

For a sceptical view on pharmacogenomics see Guernsey McPearson ‘Braking the code’ http://www.senns.demon.co.uk/wprose.html#Braking and ‘Phairy story’ http://www.senns.demon.co.uk/Phairy%20Story.htm

To return to confidence intervals, I always used to teach them in terms of hypothesis testing as the the set of null-hpotheses that would not be rejected etc.

• Corey says:

if… the confidence interval… has limits (say) 0.1 to 0.4 that this means that the true treatment effect (active -placebo) varies at the individual level between these limits with 95% probability.

You– you’re kidding, right? Please tell me you’re kidding.

• question says:

Corey,

I have been to talks where researchers say they increased the sample size in order to “reduce variance” and it did not seem to bother anyone. There is widespread confusion over even the meaning of standard deviation vs standard error of the mean.

• Juho Kokkala says:

But increasing the sample size does reduce variance, namely, the sampling variance of the effect size estimate. Thus, if they say, e.g., “we increased the sample size to N=1000 in order to reduce variance,” I don’t see why anyone should be bothered.

• question says:

It was just everyday variance. But you got me to simulate some data of different sample sizes and while the average variance was not affected there was a wider range for small samples (n<10). So I suppose sample variance is decreased with sample size and I was wrong.

• Stephen Senn says:

No, Corey, I am not kidding. Another error commonly made is that because two empirical distributions overlap (to some degree) so that some in the treatment group are worse than the control average, this proves that they did not benefit. The fact that some statisticians are now championing overlap measures of the treatment effect contributes to the confusion. I am not saying that the statisticians make the mistake but many of their medical colleagues will.

See my comment on Thas et al http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9868.2011.01020.x/full

• Anonymous says:

I’m certainly skeptical of pharmacogenomics, but how is retrospectively reporting outcome/baseline/treatment correlation any worse than all the non-randomized correlations which are endlessly reported and discussed all the time in epidemiology or social sciences? As long as it’s interpreted for the observational correlation that it is and not as a well-identified causal effect, is this really the fraud that you seem to be implying it is?

Also, regarding the statement “pharmacogenomics is based on the largely untested hypothesis that patients respond very differently to treatment from each other”. I’d agree that it’s questionable whether genomics is the right basis for identifying differences in treatment response, but I would take as a default that it’s almost inconceivable that true treatment effect differences will be exactly 0 across individuals. Whether we pick the right characteristics for differentiating patients or have the power to detect such differences is the problem.

9. Eric Loken says:

I agree with the earlier post that survey question #3 is interesting.

When the 95% CI is [.1, .4] then if you say:

“we can reject the null hypothesis at the p = .05 level.” you pass.

but “the null hypothesis is likely to be incorrect” shame on you

Yes, we want to make sure students don’t interchange p(data|null) and p(null|data). However, if they are told that it is correct to reject the null hypothesis, then why not be allowed to think that the evidence is inconsistent with the null and, barring some strong prior reason to suspect that the null is true, it is “likely to be incorrect”? Would we prefer they read the evidence and assert “The null hypothesis is likely to be correct”? So if faced with a TRUE/FALSE test, why not respond regarding the apparent truth value of the statement rather than infer that the TRUE/FALSE designation refers to the permissibility of the statement itself?

Just saying #3 would be a poor intro stat exam question, especially if the results were used to say that the students were fundamentally confused about the material in the course. I can live with the others.

10. Nicolas says:

I had the same feeling as some of the other posters, namely that statement 3:

“The “null hypothesis” that the true mean equals 0 is likely to be incorrect”

is not false. The confidence interval for this particular sample does not contain 0, and we know that across hypothetical repeated samples 95% of confidence intervals constructed in this way include the true mean. Hence, unless we’re unlucky in our draw of this particular sample so that it leads us to construct a confidence interval that excludes the true mean, why is the idea that “the true mean equals 0 is likely to be incorrect” false? I can’t figure out what I’m missing here.

• question says:

“We are inclined to think that as far as a particular hypothesis is concerned,
no test based upon the theory of probability can by itself provide any valuable
evidence of the truth or falsehood of that hypothesis.”

Neyman, Jerzy; Pearson, Egon S. (1933). “On the Problem of the Most Efficient Tests of Statistical Hypotheses”. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 231 (694–706): 289–337. doi:10.1098/rsta.1933.0009. JSTOR 91247

I would say that in many cases the null hypothesis is known to be incorrect anyway (two groups are exactly equal…), but we do not have enough information from that survey question to say.

11. EJ Wagenmakers says:

Nicolas, Eric, John, and others,

Thanks for your comments. Note that the instruction said that “false” means that the statement does not follow logically from the result. The statement that the null hypothesis is likely to be incorrect does not follow logically from the confidence interval result. As you mention, such a statement about the null hypothesis depends on its prior plausibility. For instance, if the confidence interval on effect size for ESP is from .1 to .4, you would not be compelled to conclude that the null hypothesis is likely to be incorrect. On this blog, Andrew describes many examples of recent research for which the conclusion “the null hypothesis is incorrect” is premature, even though the confidence interval may not overlap with zero.

Cheers,
E.J.

• Anonymous says:

Sure E.J. I see your point, and as I said in general the demo is good. But what information is the crazy gloved scientist trying to convey by reporting 95% CI [.1, .4]? By your logic, the following inference would be false:

The effect size is likely not -55 gazillion.

What you are saying is that any person who dares assert anything about the “likely” nature of any point estimate outside the 95% interval is fundamentally confused about statistical inference.

This includes 86% of Dutch research faculty – a group you’ll be pleased to know I put a very high prior ability rating on!

• Anonymous says:

Um, that was me, Eric. not sure where my name went.

• EJ Wagenmakers says:

Hi Eric,

Yes, this is what we are saying! Assertions about the plausibility of a parameter value requires that the prior be taken into account, lest one confuses p(data|theta) with P(theta|data). In your example, if general relativity theory were to specify that a particular effect size should be exactly -55 gazillion, then we better assign that value substantial prior probability. Usually we know, of course, that effect sizes of .1 are much more plausible than those of -55 gazillion. And if we know this, we should use that information in the prior, and our Bayesian credible interval (which does quantify the plausibility of the parameter values) will differ from the frequentist confidence interval.

Cheers,
E.J.

• EJ Wagenmakers says:

…and I’m not even talking about situations where the unconditional frequentist confidence interval provides radically different (nonsensical) results compared to the Bayesian credible interval (which conditions on all of the data). The Berger & Wolpert 1988 book has some great examples of what happens if you don’t condition on the data you’ve observed.
E.J.

• Eric Loken says:

EJ, so basically you are denying that there is any information value at all to a confidence interval. Even though the language of NHST is to say “reject”, the words “unlikely” and “implausible” are off limits?

I think it’s one thing to tell people that their classical p-value is not the p-value of the hypothesis they are testing, it’s another to tell them that their classical CIs say nothing whatsoever about what effects are “less likely to be true” or “implausible”. When I read empirical literature with confidence intervals, I feel like I am learning more than nothing.

• EJ Wagenmakers says:

Hi Eric,

In principle, frequentist statistics does not allow you to attach probability/plausibility statement to parameters. Researchers desperately want to do this, of course, and this is where the confusion starts. So yes, “unlikely” and implausible” are off limits. One could argue that “reject” when p < .05 is not technically wrong because it refers to an action. The distinction is subtle and I guess it will be lost on most practitioners.

But I am not saying there is *no* information to a confidence interval. Often there is, in particular when frequentists try to condition on relevant aspects of the data at hand (relevant subsets). Regardless, the main point is that in order to attach probability statements to parameters, the information from the data needs to be combined with a prior.

Cheers,
E.J.

• Anonymous says:

EJ,

Can you rule out that some of the respondents in your survey did not combine *their* prior with the information in the interval estimate to arrive at their response of TRUE to one of the probabilistic statements about the parameter?

That said, I was also wondering if you ran a pilot of this survey with a fair number of statisticians who you trust to understand confidence intervals. I think that given the discussion here, that there may be a good chance (it’s OK, I’m Bayesian) that respondents may not have interpreted the question the way you thought they would, even if they completely share your views on the lack of probabilistic interpretation that can be ascribed to a frequentist interval estimate.

• question says:

EJ,

I did not fully read the paper so maybe this information is included somewhere, but either way it is not as prominent. I think one obstacle to getting people to understand why their interpretations of p-values and CIs are wrong is that they do seem to correspond to the falsity of the hypothesis. If the effect size is far away from zero the p-value *is* lower. Perhaps future attempts could go into more detail on why the interpretations seem to make sense.

I found this paper by Michael Lew to be helpful. He has shown that (at least for the t-test) a p-value and sample size together index a likelihood. He suggests retaining p-values but getting rid of the concepts of “significance” and accepting/rejecting. This also makes some sense out of the nil null hypothesis:

“An interesting point often raised in arguments against the use of hypothesis tests and
signi significance tests for scientific c investigations is that null hypotheses are usually known to
be false before an experiment is conducted. When denied as the probability under the
null hypothesis of obtaining data at least as extreme as those observed, P-values would
seem to be susceptible to the criticism in that they measure the discordance between the
data and something that is known to be false. The argument may have some relevance
to hypothesis tests, but it is irrelevant to any use of P-values in estimation and in the
assessment of evidence because the null hypothesis serves as little more than an anchor
for the calculation|a landmark in parameter space, as was discussed in section 4.1.1.”

To P or not to P: on the evidential nature of P-values and their place in scientific inference
Michael J. Lew (Submitted on 1 Nov 2013)
http://arxiv.org/abs/1311.0081

This was the first time I have come across something that effectively communicated to me (a non-statistician) why the incorrect interpretations p-values appear to make sense.

• question says:

I apologize to Michael for the typos when quoting him. It did not paste properly.

12. bxg says:

In many places and even some areas of study, the “false” answers are the ones you need to know (or know how to go along with, from time to time) in order to stay employed and relevant.

Sure, you need to know that a CI of [0.1,.4] doesn’t mathematically imply that negative mean is improbable, and in some contexts (having a beer with other statisticians) it would be embarrassing to get this wrong.

But in industry you will often be trained to go along with the “wrong” answer. Because what else are you going to
do? Tell them to do their statistics differently? (I.e. ignore the “text book” techniques that everyone
knows, and are still considered the mandatory gold standard in most academic fields). Or shorten your career
as you quibble with _every single good-faith attempt_ your manager/client makes to paraphrase the results
in terms somewhat useful or interpretable to them? Even if you work at it to train your _direct_ audience
about what “confidence” means in this context, you inevitably see that at one level removed your result
is rephrased incorrectly (in the marketing literature, lack of “confidence” becomes “improbable” or “unlikely” – and that’s if you are lucky).

Shorter version. “I got a CI of [0.1, 0.4]” might not logically imply -ve mean is improbable. But
as a speech act, in some contexts making his claim absolutely DOES communicate the assertion that “-ve mean is improbable”. You present a CI in such contexts, you ARE telling people this consequent. You might not like this,
nor do I, but IMHO it’s often so nevertheless.

• question says:

“But in industry you will often be trained to go along with the “wrong” answer. Because what else are you going to
do? Tell them to do their statistics differently? (I.e. ignore the “text book” techniques that everyone
knows, and are still considered the mandatory gold standard in most academic fields). Or shorten your career
as you quibble with _every single good-faith attempt_ your manager/client makes to paraphrase the results
in terms somewhat useful or interpretable to them? “

This is a serious problem. Even if the author/presenter understands what the terms mean, over 90% of the audience will be confused anyway.

There is simply no way to effectively communicate with people whose thinking is polluted by decades of statistical myths they have developed in trying to make sense of the nhst hybrid and frequentist logic. This is driving intelligent people from research and leaving behind ever higher percentages of those who are confused. Consider the frustration of a student who will not be listened to by their “superiors”.

Fisher’s prediction came true:
“We are quite in danger of sending highly-trained and highly intelligent young men out into the world with tables of erroneous numbers under their arms, and with a dense fog in the place where their brains ought to be. In this century, of course, they will be working on guided missiles and advising the medical profession on the control of disease, and there is no limit to the extent to which they could impede every sort of national effort.”

Fisher, R N (1958). “The Nature of Probability”. Centennial Review 2: 261–274.

13. Matt Williams says:

What strikes me about the questionnaire is the fact that if we assume a reasonably flat then all of the statements (bar #6?) are likely to be correct, or very close to correct. And how much prior information is given in the scenario the participants are exposed to? None… So what kind of prior is reasonable? Surely a flat one… The article mentions the idea that non-informative priors aren’t “valid” [proper?] – but isn’t it the case that even a weakly informative prior will generally lead to a credible interval that is very similar to the corresponding confidence interval?

Obviously I’m not trying to claim that all the participants in this scenario are actually Bayesians who sat down, chose priors, and converted the information given into credible intervals! But whether or not these kind of statements about confidence intervals are *importantly* wrong really does depend on the prior.

I wonder whether, instead of telling researchers that their interpretations are wrong, a better line of attack could be to argue that the interpretations that folks generally make of confidence intervals are (approximately) valid only given assumptions about the prior information that are unlikely to be true. Usually we know that effects are most likely within yelling distance of zero. Very often we have some prior idea of their direction.

I think that the types of claims in the six incorrect statements are perfectly reasonable things for researchers to *want* to be able to make claims about. Instead of telling them they’re making bad interpretations, maybe we should be telling them that they’re aiming for the right kind of conclusions, but using the wrong tests. I just worry that people read this kind of stuff and take away the message that they need to be more careful to make the “correct” mealy-mouthed interpretations of confidence intervals and p values, while still doing the same old analyses.

• Andrew says:

Matt:

I agree. The problem is with the flat prior, and ultimately the solution has to be to add information into the problem, not to just reformulate a p-value as a confidence interval or a posterior distribution or whatever. As you write, the problem with these confusions is not so much that people are saying the wrong thing, but rather that there’s something they want to say that doesn’t really work in many settings.

• question says:

Andrew,

Many of the people making these errors will not know what a prior distribution is and will not want to include subjective information or previous data into their calculations. History has shown that they prefer their myths to spending time looking into this.

I think it will be easier to instead to make the distinction between theories capable of precise prediction and those capable of only predicting a direction. Then get rid of all “testing” in the latter case. Instead the role of that research is to describe the data and look for patterns until someone can come up with a theory capable of prediction. Paul Meehl (who does not receive nearly enough attention in my opinion) describes this issue very well. It should be required reading for all scientists:

“Because physical theories typically predict numerical values, an improvement in experimental precision reduces the tolerance range and hence increases corroborability. In most psychological research ,improved power of a statistical design leads to a prior probability approaching 1/2 of finding a significant difference in the theoretically predicted direction. Hence the corroboration yielded by “success” is very weak, and becomes weaker with increased precision. “Statistical significance” plays a logical role in psychology precisely the reverse of its role in physics. This problem is worsened by certain unhealthy tendencies prevalent among psychologists, such as a premium placed on experimental “cuteness” and a free reliance upon ad hoc explanations to avoid refutation.”

Theory-Testing in Psychology and Physics: A Methodological Paradox
Paul E. Meehl. Philosophy of Science
Vol. 34, No. 2 (Jun., 1967), pp. 103-115
http://www.jstor.org/stable/186099

• question says:

Andrew,

Actually, I would appreciate it if you made a blog post about this paper so I could see what others here think about it. I have not been able to find a critique anywhere despite the large influence its had on my own thinking. I think it would be productive, your blog posts and the comments contain the most intelligent conversation of science/statistics I have seen anywhere on the internet.

• Matt Williams says:

I think this is an interesting point. The problem is that coming up with a *plausible* theory that makes point predictions about human behaviour is a Herculean task.

The standby option of exploration, description and pattern-finding sounds great, but I suspect to do this effectively would require a change to how people go about the task of social science research that’d be even greater than switching from frequentist to Bayesian inference.

14. Jan says:

Thanks for the link and your comments. I’m an autodidact in statistics and rely on discussions like these to fine-tune my understanding of such concepts. Apparently, my intuition gets it wrong, but I still don’t understand why.

All right. There is some population of which only I know the real mean (mu) and other properties. I draw 1,000 samples of 20 datapoints from this population and compute 90% (symmetric) confidence intervals for each of the sample means. About 900 of these 1,000 intervals contain the secret mu.

Then, I pick one of these 1,000 intervals *at random*. What’s the probability that I picked one that contains the secret mu? Surely 90%? Surely you should be happy to lay, say, 5-to-1 odds when betting that one of these 1,000 intervals, picked at random, contains the secret mu?

I always thought the problem with the word ‘probability’ in this context was that statisticians reserve that word for talking about random variables. Since the secret mu is fixed rather than random, it can’t have its own distribution. But why can’t I say that the interval that I pick at random is a 9-to-1 favourite of containing mu? Is it merely a matter of terminological convention perhaps?

I appreciate that the situation can change considerably once you have some prior knowledge and you know what the random interval actually is. But in the absence of prior knowledge, why can’t *you* say that the randomly picked interval of [24.08; 26.39] is a 9-to-1 favourite of containing mu? (I happen to know it either does or doesn’t.)

• EJ Wagenmakers says:

Hi Jan,

There’s many responses to the issues you raise. My immediate response is this. The performance guarantee applies to the procedure when repeatedly applied, so *on average*, across the sample space of hypothetical replications. It does not apply to a specific interval. Suppose you observe data, and the procedure that gives you 90% coverage yields the interval [24.08; 26.39]. For these specific data, it could be evident that the mean mu cannot lie in this particular interval, even though the procedure you used to compute that interval has 90% coverage. Here is an extreme example: Suppose I weigh people on a scale that gives an accurate reading on 90% of the cases; in 10% of the cases the scale malfunctions and returns “1”. Now on average, my statements about weight will be correct 90% of the time. Conditioning on the data, however, shows that I have learned nothing when x=1, whereas for any other x I the correct value is known with certainty. Reporting 90% confidence when x=1 seems silly.

This is the price you pay for not conditioning on the data that were actually observed. In fact, in the context of betting you can demonstrate that the only way to avoid losing money (from the existence of recognizable subsets, subsets for which the average is not representative) is to condition on *all* of the data. However, doing so makes one a Bayesian. A discussion of this is in Chapter 2 of the Berger and Wolpert (1988).

Perhaps it would have been better if the “confidence interval” had been named “coverage interval”, or “average performance interval” or really anything else but “confidence interval”.

Cheers,
E.J.

• Jan says:

Hello E.J.,

Thanks for taking the time to reply.

***

“Suppose you observe data, and the procedure that gives you 90% coverage yields the interval [24.08; 26.39]. For these specific data, it could be evident that the mean mu cannot lie in this particular interval, even though the procedure you used to compute that interval has 90% coverage.”

Sure, but then you rely on prior knowledge. I fully appreciate that prior knowledge ought to be taken into account when applicable, but in this example, there wasn’t any to go on. (From what I gather, this is the uniform prior distribution that others have been talking about.) In that case, isn’t the interval [24.08; 26.39] just one out of many and one that has a 90% chance of being correct?

Up till now that has been my understanding of confidence intervals: baseline probability in the absence prior knowledge. Of course, that doesn’t mean I would blindly ‘trust’ any given CI when prior knowledge *is* available. For instance, I recently got a 95% confidence interval for a regression coefficient that spanned from -0.17 to -0.01. But a negative coefficient would make little sense in that context, so I qualified that CI in order to ward off overinterpretations of this counterintuitive result (of small size).

***

I don’t understand what you mean by “the only way to avoid losing money … is to condition on *all* the data”. I’ll take a look at the chapter you recommended (thanks!), but in the meantime: Are you saying you wouldn’t be willing to lay 8.5-to-1 odds that my secret mu is in the [24.08; 26.39] interval? (The computer picked it at random.) If you aren’t, I’m still not getting it. (Or perhaps you have reasons to believe that I’m rather fond of the number 27?)

***

Re: the CI’s name. I agree, and it’s a difficult term to translate, too. ‘Betrouwbaarheidsinterval’, but ‘confident’ translates as ‘zeker’, not as ‘betrouwbaar’.

Thanks again!

15. John says:

So it seems EJ that you’re stating that because there are cases for which one could be quite certain that the true means are 0 then indicating that answer 3 is a correct interpretation, is an error. But I’m not sure that, in the way things are put in this paper, one can’t say that the null is unlikely to be true. Even if I accept all of the Bayesian arguments I’m pretty sure that given the whole universe of experiments it’s still not likely. If you have attached some specific probability to #3, like 95% confident it doesn’t include 0, I’d be in total agreement that it’s false. But that loosely termed item in the questionnaire still seems unfair.

The “95% confidence” label is my probability of being correct in estimating an interval that captures the mean. It’s certainly not always 95% because rarely are all of the assumptions met exactly. But it’s usually pretty high as far as being right about anything statistically goes. I do an experiment for which there are no priors and there is no known mean. As stated in the questions, we know nothing but [0.1, 0.4]. If that’s all I know I’m pretty willing to stick my neck out and say that 0 is unlikely to be the mean. I’d even bet 1\$ on it. Without some prior I can’t attach a specific value to my “unlikely” and without further information I can’t be really certain about much. Arguing after the questionnaire that we were supposed to believe a professor named “Bumbledorf” with demonstrably poor hygiene had more information; or that we should have incorporated Bayesian thinking into a frequentist story seems unfair in the context of the test.

All of the other items on the questionnaire fall down entirely within frequentist thinking.

If we’re going to let Bumbledorf be really smart then how about we let him be smart enough to have a good idea in his field when CI’s have good coverage and when they don’t. Further, we allow him to be very flexible on this matter and not be calculating CI’s as a matter of course when he’s clearly got a situation for which it’s not going to help much anyway. In that case I’m trusting Bumbledorf’s CI and saying 3 is correct. If you can add in a Bayesian argument post hoc to say 3 doesn’t follow can’t we just add in any old information? I mean, he’s a professor! He should know what he’s doing. The Bayesian argument being added here doesn’t necessarily follow from the situation set up in the questionnaire any more than this one does.

• question says:

“If that’s all I know I’m pretty willing to stick my neck out and say that 0 is unlikely to be the mean.”

Where there is no prior information the mean is unlikely to be exactly any value (including zero) whether or not it is within the confidence interval. The logic behind CIs does not allow a claim to be made about the falseness of a hypothesis on it’s own:

“We are inclined to think that as far as a particular hypothesis is concerned,
no test based upon the theory of probability can by itself provide any valuable
evidence of the truth or falsehood of that hypothesis.”

Neyman, Jerzy; Pearson, Egon S. (1933). “On the Problem of the Most Efficient Tests of Statistical Hypotheses”. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 231 (694–706): 289–337. doi:10.1098/rsta.1933.0009. JSTOR 91247

You need a theory that predicts a precise value if you want to make informed bets as to what this value will be. If the theory predicts zero and the CI does not overlap then either the theory is incorrect, it is due to sampling error, or the experiment got messed up. The probability your null hypothesis is false will depend on how you assess those possibilities.

If the theory is that the mean difference between two groups of people is exactly zero (they differ in no way other than the treatment), I would find it difficult to accept this is due solely to technical problems with the experiment. If the theory predicted the intensity of light measured some distance from a source, and many previous experiments yielded results consistent with the theory, I would think it is likely that something went wrong with the experiment.

Really, focusing on means and treating individual differences as noise is misguided in the first place, so I find much of this discussion to be begging the question. It makes sense if the individual differences are due to measurement error (the location of saturn when using different telescopes), but not when the individuals are people/animals and the final goal is to guide medical practitioners or inform social policies.

• John says:

“But we may look at the purpose of tests from another view-point. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong.”

From which I’m going to use the CI95 procedure to devise a rule for betting on a range of values where I’m not often wrong.

• question says:

John,

The question says “False means the statement does not follow logically from Bumbledorf’s result”.

I agree (knowing only the confidence interval) that the true mean=0 is likely to be incorrect. Lets answer some alternative questions, ignoring the qualification “follows logically from the result”:

The “null hypothesis” that the true mean=
A) 0.25 is likely to be incorrect
B) 0.1 is likely to be incorrect
C) 0.099 is likely to be incorrect
D) 1000 is likely to be incorrect

I would answer true to all, including those within the interval. The null hypothesis is that the mean=0 is an exact value. We are truncating the decimal out of convenience. It is actually that the mean=0.000000… Given the confidence interval in the example we could guess the most likely value is 0.250000… Those would be comparable bets and one is not much better than the other unless you have a theory that predicts the value. I think you are comparing betting on an exact value vs betting on a range of values.

The 1933 N-P paper also makes this argument:
“Indeed, if x is a continuous variable-as for example is the angular distance between two stars-then any value of x is a singularity of relative probability equal to zero.”

Also, if everything went right with the experiment and there is only sampling error, then 95% of the 95CIs will contain the “true” value but the CI algorithm does not distinguish between e.g. (0.1 and 0.25) or (0 and 0.099). The boundaries of the interval have been determined using the arbitrary 0.05 alpha cutoff level. We can actually calculate the alpha we would need to use that would include zero from the width of the interval.

The 95CI is mean +/- z*sd/sqrt(n), we know z=1.96 and the width = 0.3.
Subtracting the lower from upper bound we get the ratio r = sd/sqrt(n) = 0.3/(2*1.96) = 0.07653
For the lower bound to equal 0, then
mean-z*r = 0, so z = mean/r,
The z-score is the related to alpha by taking the cdf of the standard normal(pnorm in R code) :
1-alpha/2 = pnorm(z),
alpha = 2-2*pnorm(mean/r) = 2-2*pnorm(3.266667) = 0.00108

A 99.9CI would include zero which is around “3.5 sigma”, not good enough to “reject” zero by the physics community.

So, it appears that a value being outside the 95CI is not necessary to make the claim it is likely incorrect, nor sufficient to make claim it is likely incorrect. If A is neither necessary nor sufficient for B, it is difficult to see how B could be said to follow logically from A.

• John says:

This seems like a substantial tangent.

The fact that there’s another way to get to the conclusion that 0 is unlikely to be the true value has nothing to do with the question. The question does not say, “follows logically from the result and only from the result.”

And, while you’re ignoring the “follow logically” clause you’re also adding things to the given information. We we do not know that we have continuous values. All we know is that we have a CI.

• question says:

Lets say instead n is small, this is possible from the CI we have been provided. In this case we use a t distribution for our “z-value” and don’t need to appeal to the cautious physicists. If n=2, the sd must be ~0.017 and a 97% CI would include the mean:

n = 2
z = qt(0.975,df=n-1)=12.7062
r = .3/(2*z)
sd = r*sqrt(n) = 0.016695
alpha = 2-2*pt(.25/r,df=n-1) = 0.0300

Lower97CI = 0.25-qt(1-.03/2,df=n-1)*s/sqrt(n) = -0.0003
Upper97CI = 0.25+qt(1-.03/2,df=n-1)*s/sqrt(n) = 0.5003298

From the same data (if n=2, which we don’t know for sure), the interval (97CI instead of 95CI) is now [-.0003, 0.5003] and includes zero just because we slightly changed the alpha-level. Would you still bet that zero was not the true mean? For what set of CIs would you not bet?

16. Christian Hennig says:

I think that the problem with the statements 3 and 5 is that most people wouldn’t use a proper technical definition to assess the terms “likely” and “x% confident”.
The technical meaning of the term “probability” has been debated for centuries and people still don’t agree what that means, although knowledge of both Bayesian and frequentist interpretations of it, which are discussed in plenty of courses and textbooks, will enable people to say that statements 1, 2 and 4 are indeed wrong.
It is not so clear though that “likely” should technically refer to a probability or likelihood, and therefore I wouldn’t agree with Wagenmakers, who wrote on this blog: “In principle, frequentist statistics does not allow you to attach probability/plausibility statement to parameters. Researchers desperately want to do this, of course, and this is where the confusion starts. So yes, “unlikely” and implausible” are off limits.” True, these terms are not a proper frequentist interpretation of the result, but can well be seen as a (somewhat underdefined) translation of such a thing into layman’s terms. I’m not happy with no. 3 because of its unclarity, but branding it “false” to me looks like claiming too much authority over how people use the term “likely” outside Statistics departments.
The story with no. 5 is slightly different. Russell Lyons, I think, wrote that “we’re 95% confident that…” could be technically defined to have the very meaning that makes statement no. 5 true. I personally hate the “we’re 95% confident”-wording, but not because it would be false, but rather because I think that it was introduced as an attempt to give people something simpler to say than what is the proper meaning of a 95% CI, which many people find too difficult and confusing. This means that it pretends to be a simpler and clearer interpretation of a CI, but in fact it is actively obfuscating, because it doesn’t add any explanation/interpretation to the proper meaning but rather hides how difficult that one is. As far as I’m aware, there is no use of “confidence” quantified by percentages anywhere except when it comes to interpreting CIs, and so nobody can expect that anybody knows properly what that means, let alone whether the given statement is true or false. Unless, as suggested earlier, it is defined to mean exactly what makes statement 5 true, in which case it is a red herring.
To repeat, I really really don’t like this wording for interpreting CIs, but if I were to decide whether it’s true or false, if it is any of these (which i think it’s rather not), it is true.

• Anonymous says:

These sloppy “somewhat underdefined translations into layman’s terms” are the heart of so many statistical misconceptions.

I think statisticians are partially to blame here, and are not without a conflict of interest. If it was taught that, strictly speaking, these analyses do not offer evidence with regard to “likely”/“unlikely” and “plausible”/”implausible” , they would be widely dismissed as useless. So instead, “translations” are presented, leading to widespread misinterpretations.

• Anon.,

1. I think that a much larger reason for the misconceptions is that very few users understand much probability, or even much math.

2. My feeling is that COIs are much larger than the one you mention. They extend to research papers as well. Don’t researchers want their techniques to be used? And wouldn’t emphasizing the hypotheses behind the theorems reduce their use? I’d be curious to hear what others think on this topic.

17. Wayne says:

Christian,

I don’t think that “we’re 95% confident” is an attempt to simplify the frequentist definition of a CI so much as it is the statement of how people think and reason. Most people, if they attempted to make it a bit more rigorous would say that being 95% confident of X means that you believe X to be approximately 19 times more likely than ~X. This is not what a frequentist CI is, and as you say the actual definition of a CI is hard for most of us to get our heads around. (Which is due to the fact that the question that a CI is answering is irrelevant to most of us.)

As I understand it, “we’re 95% confident” is a reasonable approximation of a Bayesian CI, and under certain circumstances a frequentist CI and a Bayesian CI happen to coincide. If that’s correct, then sometimes — perhaps often in the layman’ world — it turns out to be correct, but not by definition.

I’ve got to read the thread another four or five times to try to absorb everything…

• Christian Hennig says:

Wayne: “Most people, if they attempted to make it a bit more rigorous would say that being 95% confident of X means that you believe X to be approximately 19 times more likely than ~X.” Maybe; some more empirical research would be required to check this, but I know some people who teach this wording giving a very explicit definition in such a way that it makes statement 5 true, so as a student of those people one should really tick “true” there.
In any case I insist that it is obfuscating terminology, and better qualified as “unclear” than as either true or false.

18. Dean Eckles says:

Seems like it would be nice to see if this replicates when there is a correct answer in the list.

19. Dean Eckles says:

I also wish the instructions were a bit clearer, especially the part about what logically follows and what can be assumed about how the confidence interval can be produced. Just because a scientist says that their 95% confidence interval is [a, b] that doesn’t mean that they used a method that actually has the advertised coverage! Perhaps Professor Bumbledorf has a very small sample and used an interval based on the t distribution, but his data is not close to normal. This slippage between the advertised coverage and the actual coverage happens a good deal — consider the small sample bias of sandwich standard errors, the bias of non-sandwich standard errors with heteroskedastic residuals, etc.

20. Nicolas says:

Great discussion in the comments. I wonder: if statement 3 (The “null hypothesis” that the true mean equals 0 is likely to be incorrect) is false, then how can I interpret the results that the authors present about their own experiment? On the basis of finding a low number of people that correctly identify all statements as false, they conclude that “Our data, however, suggest that the opposite is true: Both researchers and students in psychology have no reliable knowledge about the correct interpretation of CIs.” Essentially, they conclude this on the basis of finding a CI for the number of people that give a completely correct response that is in the low range and excludes large numbers (say, 100 people and upwards). Are the authors then not themselves committing the error that they say statement 3 contains? Or am I being unfair here and, if so, could someone explain why? Thanks.

21. Perhaps I’m displaying my ignorance, but I’m a bit puzzled by item 4 on the survey:

‘There is a 95% probability that the true mean lies between 0.1 and 0.4.’

My understanding was that a 95% CI means that for a large number of similar samples, 95% of the calculated CIs will contain the true value for the estimated parameter. If so, then from the Bernoulli urn rule, the probability that the measurement actually done is one of those 95% is, um, 95%. Right?

Can somebody help me?

• It depends entirely on what you mean by probability. I believe the frequentist answer is that if there is a true parameter, it is either in [0.1,0.4] or it isn’t, since there’s no notion of repeated intervals all exactly equal to [0.1,0.4] but some of them having the correct parameter value and some not… there’s no sense in which a Frequentist can assign a probability *to the interval*.

The only thing that Frequentists can assign a probability to, is the *interval construction process*. So in your analogy, the interval construction process is like a bag from which you pull intervals, and the interval [0.1,0.4] is like a particular ball.

The Bayesian says nonsense, the only thing I need probability for is to tell me the degree to which something is or is not likely to be true. Then you start with some estimate of what the range of true values is likely to be, you have some other information about what the size of the differences between actual values and predicted values is likely to be, and you calculate from this the probability that the true value is in any particular interval, the probability for [0.1,0.4] is then say 99%… but there is no sense of repeated sampling here… you can’t verify that the probability is in fact 99% by repeatedly observing anything… because the parameter ISN’T OBSERVABLE, if it were, you wouldn’t need probability you could just measure the dang thing.

• The only sense in which repeated sampling holds for the Bayesian, is that they can construct a random number generator which generates random numbers according to the posterior distribution, and 99% of those random numbers from that generator will be in the interval. But this is purely a computational device.

• So is the only reason they claimed that item 4 is false that they don’t understand what probability is?

• They’re restricting probability to mean “frequency under repeated sampling” because they’re talking about confidence intervals constructed using that philosophy. The Frequentist doesn’t ever assign a probability to a parameter… so in that framework, 1,2,3,4 are all false.

5 is weird, I’m not sure the statement has any meaning, and in any case it depends on the definition of “95% confident”. In any case Frequentist confidence is only relative to the *interval construction process* not to the particular value of any given interval.

6 is wrong because it fixes the interval, we only have hypothetical guarantees about the interval generating process which will spit out different intervals each time not the same [0.1,0.4]

If all that seems bizarre to you, congratulations, you’re a natural Bayesian.

• Thanks, I thought perhaps my understanding of the CI was off, but it looks like I’m still sane.

>congratulations, you’re a natural Bayesian.

Actually, I’m a self-taught Bayesian, though I was definitely (> 95% confident) naturally attracted to it.

• Dan Wright says:

Another way to think about it is if Study 1 measures X and finds the 95% CI to be .1 to .4, and Study 2 is a replication and finds it to be .4 to .6, both plausible and suppose each well done and believable. If one interprets both as being the 95% probability that X is within those particular intervals, then you start getting problems if someone is betting with you on X being in each one (and you are being forced to give them odds reflected in whatever you think 95% probability means). I tried this a classroom demonstration once. Not sure if it worked as I usually tried lots of things because it is a difficult concept. I think it was fun for them anyway.

22. Laurie Davies says:

Statisticians of all hues operate in the behaving as if true’
mode. That is, given a parametric family of models all procedures,
whether Bayesian or frequentist, are based on the assumption that the
data were generated by some theta in the parameter space. The whole of
Bayesian statistics is based on this: two different values of theta
cannot be simultaneously true. From this follows, using a Dutch book
argument, that priors are additive, that is, they are probability
distributions over the parameter space. This in turn gives rise to the
concept of coherence and the claim that non-Bayesians are not
coherent, perhaps even incoherent. And who will willingly admit to
being incoherent? Frequentists also operate in this mode. What is the
correct interpretation of a 90% confidence interval? Its construction
is a procedure which, if followed repeatedly on data sets, results in
intervals which, in the long run, cover the true parameter’ in 90% of
the cases. Arguments on the likelihood principle are conducted on both
sides within the model, that is in the behaving as if true mode’. The
same applies to the relevance of stopping rules. All the arguments in
this blog on the correct interpretation of confidence intervals take
place in the behaving as if true’ mode. This in spite of the fact
that no statistician believes that the behaving as if true’ mode is a
true reflection of reality, that is, that the data really were
generated by some model in the family. The truth is of many orders
more complicated than this and is probably not known in any case. For
intellectual reasons if nothing else it is an interesting problem to
give an account of statistics which overcomes the contradiction
between behaving as if true’ on the one hand knowing perfectly well
that it is not on the other. Here is an attempt.

The idea is to treat models as approximations to the data and not to
some underlying truth and to do this in a consistent manner. More
precisely, a model P is an adequate approximation to the data x (real
data always lower case) if typical’ data X(P) generated under P (data
under the model always upper case to make it clear that it is not
assumed that x is an X(P))look like’ the data x. Even more precisely,
the word typical’ is operationalized by a number alpha, 0 < alpha
<=1, such that at least a proportion alpha of the data sets X(P)
generated under the model are typical. And yet more precisely, the
words look like' are defined by some numerical property or properties
of data sets such that typical data sets X(P) exhibit these
properties. What exactly these properties are depends on the data and
the reasons for analysing it. Given data x and a parametric family of
models the approximation region is the set of those parameter values
for which the corresponding models are an adequate approximation to
the data. Here is a simple example. The data x consist of n real
numbers. The model is i.i.d. N(mu,1), alpha=0.95 and the quantity of
interest is the mean. Typically the mean of a sample generated under
N(mu,1) will lie in the interval
[mu-1.96/sqrt(n),mu+1.96/sqrt(n)]. The model N(mu,1) is then an
adequate approximation to the data if the mean of the data lies in
this interval for then typical data sets generated under this model
look like the real data with look like' based on the behaviour of the
mean. Turning this around the approximation region becomes [{\bar
x}-1.96/sqrt(n),{\bar x}+1.96/sqrt(n)] which is the 95% confidence
region. The interpretation is however completely different. It is the
following. Take any mu in this interval and generate a large number of
data sets X(mu) and calculate their means. Construct the interval
defined by the 0.025*nsim and 0.975*nsim order statistics of the means
where nsim is the number of simulations. Then this interval will
include the mean of the data. In this sense the means of typical
samples X(mu) look like the mean of the data. The following points are
worthy of note: (1) the data x are given, the interpretation does not
involve frequentist repetitions of the data, (2) there are no unknown
parameters, (3) the approximation interval is conditional on the data
x (to misappropriate the Bayesian sense of conditional), (4) there
is no assumption about the true data generating mechanism', (5) the
data x can be deterministic, (6) the statistician is not in the
behaving as if true' mode.

A thorough treatment of the concept of approximation requires a
discussion of among other topics the topology of approximation (a weak
one as characterized by the Kolmogorov metric), the importance of
regularization, the role of functionals, the stability of analysis,
the use of asymptotics and strategies of model choice.

Why the Kolmogorov metric? The concept of approximation is at the
level of samples: P is an adequate approximation to the data x if
typical samples X(P) generated under P look like x. How does one
generate samples X(P)? To ease notation replace P by its distribution
function F. Then a random variable X(F) with distribution function F can be
generated as X(F)=F^{-1}(U) where U is uniform on [0,1]. Given a
second distribution function G generate a second random variable
X(G)=G^{-1}(U) with the same U. Then in general if F and G are close
in the Kolmogorov metric X(F) and X(G) will be close. Depending on the
degree of truncation X(F) and X(G) may well be equal. Thus if F is an
adequate model for x then so will be G. On a meta-level the Kolmogorov
topology is the topology of EDA.

Likelihood is an indispensable part of Bayesian statistics and is an
important element of frequentist statistics. However there can be no
likelihood based concept of approximation. To see this consider all
absolutely continuous distribution functions F with density f. The
differential operator D is defined by D(F)=f. Equip the space of
distribution functions with the Kolmogorov metric and the space of
density functions with its natural' L_1 metric. Then D is
a pathologically discontinuous linear functional which maps the first
space into the second. Likelihood has a role to play from the point
of view of approximation. When combined with regularization and
maximum likelihood it is a border post delimiting the possible.

Differences to Bayesian statistics: given data x, a model P and a
definition of approximation bets can be placed on P being an
adequate approximation to the data x. This is decided by a computer
programme with inputs x and P. The bets are realizable in contrast to
Bayesian bets. However the odds quoted even by a Bayesian will not be
describable by a probability measure. The reason is simple. It is
perfectly possible for two different parameter values to be adequate
approximations. There is no exclusion. Coherence has no role to play
and neither does the likelihood principle. From the point of view of
approximation there is not one likelihood for the data but arbitrarily
many. Further the pathological discontinuity of the differential
operator in the case of continuous models means that the different
likelihoods can lead to arbitrarily different conclusions. There can
be no likelihood based concept of approximation.

Differences to frequentist statistics: the standard interpretation of
a confidence interval has already been treated. One further aspect is
the question of conditioning on the data. Consider data which is
obtained from two different methods of measuring a quantity. The first
may be modelled as N(mu,sigma_1^2) and the second as N(mu,sigma_2^2)
with sigma_1 < sigma_2. Which method used is decided by an
independent toss of a coin. Apparently a frequentist confidence
interval for mu based on data from method 1 will have to take into
account that the data could have come from method 2. I never follow
this sort of discussion so the last sentence is the impression I
have. From the point of view of approximation the situation is
clear. If the data come from method 1 you approximate them as there
are. The possibility that they could have come from method 2 is
completely irrelevant. In this sense the concept of approximation is
conditional on the data but I try to avoid formulating it in this
manner as the phrase condition on the data' seems to have
ideological undertones.

23. Christian Hennig says:

This seems too good but also too long to be buried somewhere as comment #110 to a blog post.
I hope that a few more people read and understand it.

• K? O'Rourke says:

But think it is a misapprehension to take the _data_ as something real that can be directly apprehended.

It came about somehow (randomly or selectively sampled) being recorded (misclassified or biased) and taken as being relevant to something outside this particular data set (inference for what population in what sense).

There is a literature on avoiding probability models – but not my area.

24. Christian Hennig says:

“It came about somehow (randomly or selectively sampled) being recorded (misclassified or biased) and taken as being relevant to something outside this particular data set (inference for what population in what sense).”
OK. But if you start from the philosophical assumption that this “something outside” is much more complex than our simple models anyway, but you still want to use such models to find some simple summarizing “stories” to tell about the data, it is of some use to find sets of models “compatible” with the data in the above sense.

• K? O'Rourke says:

Agree, only tentatively take a model as true only after they pass fit tests of some sort.

As Peirce put it “You don’t see a rose, you hypothesised you are seeing a rose and no doubts about that arose.”

• Christian Hennig says:

Even “tentatively take a model as true” seems to be inappropriate wording for “the model is compatible with the data”, particularly because there are always sets of models for which this holds.

• K? O'Rourke says:

If you are not already aware, John Copas (Warwick) has done some interesting work with sets of models compatible with the data.

25. […] In Episode 58 the guys started the show admiring Ben’s new computer, and his House of Clay beer, before talking about Don and Victoria Backham’s treadmill desks, Ricky Gervais bathtub photos, dressing up like a realtor, and confidence intervals. […]