The Fallacy of Placing Confidence in Confidence Intervals

Richard Morey writes:

On the tail of our previous paper about confidence intervals, showing that researchers tend to misunderstand the inferences one can draw from CIs, we [Morey, Rink Hoekstra, Jeffrey Rouder, Michael Lee, and EJ Wagenmakers] have another paper that we have just submitted which talks about the theory underlying inference by CIs. Our main goal is to elucidate for researchers why many of the things commonly believed about CIs are false, and to show that the theory of CIs does not offer a very compelling theory for inference.

One thing that I [Morey] have noted going back to the classic literature is how clear Neyman seemed about all this. Neyman was under no illusions about what the theory could or could not support. It was later authors who tacked on all kinds of extra interpretations to CIs. I think he would be appalled at how CIs are used.

From their abstract:

The width of confidence intervals is thought to index the precision of an estimate; the parameter values contained within a CI are thought to be more plausible than those outside the interval; and the confidence coefficient of the interval (typically 95%) is thought to index the plausibility that the true parameter is included in the interval. We show in a number of examples that CIs do not necessarily have any of these properties, and generally lead to incoherent inferences. For this reason, we recommend against the use of the method of CIs for inference.

I agree, and I too have been pushing against the idea that confidence intervals resolve the well-known problems with null hypothesis significance testing. I also had some specific thoughts:

For another take on the precision fallacy (the idea that the width of a confidence interval is a measure of the precision of an estimate), see my post, “Why it doesn’t make sense in general to form confidence intervals by inverting hypothesis tests.” See in particular the graph which illustrates the problem very clearly, I think:

Regarding the general issue that confidence intervals are no inferential panacea, see my recent article, “P values and statistical practice,” in which I discuss the problem of taking a confidence interval from a flat prior and using it to make inferences and decisions.

My current favorite (hypothetical) example is an epidemiology study of some small effect where the point estimate of the odds ratio is 3.0 with a 95% conf interval of [1.1, 8.2]. As a 95% conf interval, this is fine (assuming the underlying assumptions regarding sampling, causal identification, etc. are valid). But if you slap on a flat prior you get a Bayes 95% posterior interval of [1.1, 8.2] which will not in general make sense, because real-world odds ratios are much more likely to be near 1.1 than to be near 8.2. In a practical sense, the uniform prior is causing big problems by introducing the possibility of these high values that are not realistic. And taking a confidence interval and treating it as a posterior interval gives problems too. Hence the generic advice to look at confidence intervals rather than p-values does not solve the problem.

I think the Morey et al. paper is important in putting all these various ideas together and making it clear what are the unstated assumptions of interval estimation.

140 thoughts on “The Fallacy of Placing Confidence in Confidence Intervals

    • Richard: I enjoyed your paper, and agree with much of what you say. But the “no defence” part (on pg 8) seems to make an unjustified leap.

      Having shown that standard Bayes and non-Bayes intervals line up for the student t situation, you write that “If researchers were only expected to study phenomena that were normally distributed […] then inference by confidence procedures might seem indistinguishable from inference by Bayesian procedures.”

      Your general “if” here doesn’t follow from your one specific example. I believe the relevant condition, basically, is that researchers use efficient *estimators* that are Normal, or close to it – in which case the two approaches line up closely. (A precise statement would require a lot more math than your target audience would understand.) To be sure, this isn’t everything, but it encompasses much more than the Normality of data you discuss.

  1. Confidence intervals are numbers. Numbers make a paper scientific; the more numbers you squeeze out of your data, the more scientific you are. The current issue (Dec 11, 2014) of the New England Journal of Medicine has an article on breast cancer treatment where the experimental arm had no overall advantage, but the discussion prominently focuses on one of the subgroups where an advantage is claimed. I had to get my strongest glasses to be able to see the difference in the curves. I love mathematics and numbers, but I do wonder if there is a technical term among staticians for numerical claims of significance when the separation of the curves requires eagle like vision.
    I hope I am not missing the point of your article, Dr. Gelman, but I have sat in many lectures listening to p values and CI’s where the curves displayed behind the lecturer were indistinguishable from the twelfth row. There has got to be a short word for this.

  2. OK I’ll take the bait using the bubble example from the linked paper. The CI is estimated from one pair of bubbles, so, thinking of the CI as just another parameter to estimate, of course we have little confidence in its value. If we repeated this thought experiment many times, the computed CI would be vary unstable (high variance). But if we sampled and computed a frequentist CI from many locations of bubbles, the variance of this estimate would be much less. Shouldn’t we have more confidence it it?

      • But this already happens, but in a different way. A confidence procedure will make the interval width from 2 samples wider (typically) than that from many samples, so that the confidences can be calibrated consistently. It sounds like you want to somehow “downgrade” our confidence in, say, the 2-sample “50%” confidence procedure (to something less than 50%?)
        because of the low sample size – even though the interval width has already expanded somewhat to compensate for the lower information. I don’t quite know what you are suggesting concretely, but doesn’t it risk double-penalizing small sample sizes?

        • Hmm, yes you corrected my fuzzy thinking, so let me try an alternative to “save the concept”! Assuming a uniform distribution across the space where bubbles might arise, the frequency of two sampled bubbles arising in almost the same location would be unusual. So yes, if this happens to be the case, a frequentist confidence interval is a poor estimate of the location of the mid-point and my confidence in the point estimation is wrong. But what is the frequency of my confidence being really wrong? The example reminds me of Maxwell’s Demon.

        • Internet ate my previous comment, trying shorter version. This bubble argument is a bit dishonest. It basically says that people use CI to answer questions that CIs technically are not created to answer. OK, we all know that. To show how bad it is they invent an example where no sane person would condone using CI from 2 bubbles for the question asked. Rather than fighting a straw man, they’d better come up with an example where using CIs off-label seems superficially reasonable, but in fact not simply technically wrong, but in fact completely inappropriate.

        • Who writes the labels for confidence intervals? How is this “off-label” or “dishonest”? We did not “invent” this example. It comes originally from Welch (1939) who deals with the the setup from a frequentist perspective, inventing a thoroughly wacky confidence interval, and Berger and Wolpert (1988) in their monograph on the likelihood principle. We use several confidence procedures in this setup to demonstrate that the technical definition of “confidence” has no interesting implications for inference, by itself. The example serves that purpose quite well.

          I’m perfectly happy if no sane person would use these interval. That’s our point, in fact. But do look up Welch’s example for another example (besides our CP2) of an interval no sane person would use. That one can justify some weird intervals that no sane person would use on frequentists grounds is, in fact, our point.

        • Just a nit-picky note that the “50% CI” in example 1 is actually closer to a 75% CI because the underlying data are uniformly distributed. I don’t think I saw that mentioned in the paper, but I quickly skimmed it.

        • Sorry, I was not clear enough. Definition of CI is given by Neyman (you repeat it) and is valid as such. CI 50% means that if you repeat the procedure many times the CI will catch the true value 50% of the time. It has nothing to do with figuring out how probable it is that your particular interval for the data that you already have contains the true value. You write as much in your paper. This is what’s on the label. Off-label people use CI to figure out the uncertainty of the measurement that they’ve already made. I don’t know for what purpose Welch has invented this example in 1939, but the argument “don’t ever use this method because I have a wacky example where it does not work” is not persuasive at all. What is persuasive is realistic example where it does not work.

        • D.O.:

          As far as I’ve ever seen in applied statistics, people only use CI to figure out the uncertainty of the measurement that they’ve already made.

          To call this an “off label” use of confidence intervals is like calling intoxication an off-label use of vodka.

  3. I’m a great fan of the authors’ collective body of work (Morey and colleagues), but I wish that statisticians would not say things like:

    “we recommend against X” or “we recommend Y”

    This contributes to making statistics a mysterious, esoteric subject, where the user is expected to follow whatever some expert thinks is the right way (and depending on whom you talk to, this can lead to radically different recommendations) rather than understanding the issue and making your own decision based on understanding.

    It would be better to state the problem, and explain its causes, and try to communicate that, rather than handing down recommendations.

  4. Philosophically, I am fully on-board with a Bayesian and decision theoretic approach to statistical inference in my own life. Though, I have not a lot of experience with the challenge of eliciting priors from domain experts and the practical compromises one may have to make in methodology in order to get research published in journals.

    I do teach mathematical statistics courses often, the list of topics I am given are classical in nature. For someone that is aware of many of the problems with the underlying philosophy, common misconceptions, and misapplication in research practice, what approach to teaching topics like null hypothesis statistical testing and interval estimation do folks here think would represent “best practices”? How much time in the class should be devoted to the “nuance”?

    For a student that may be applying to your PhD program in statistics, what do you expect them to understand about the mechanics of these classical inferential methods and how deeply do you expect them to understand about the pitfalls? Would folks anticipate that there would be any kind of “push back” from folks outside your statistics department by folks that may like to happily continue using and misusing these classical inferential methods?

    • JD asked: “For a student that may be applying to your PhD program in statistics, what do you expect them to understand about the mechanics of these classical inferential methods and how deeply do you expect them to understand about the pitfalls? Would folks anticipate that there would be any kind of “push back” from folks outside your statistics department by folks that may like to happily continue using and misusing these classical inferential methods?”

      My perspective/experience has been that it is important to try to reach people beyond formal education, as well as those still in the formal education system. To that end, I developed and taught for several years a “continuing education” course through a program called a “Summer Statistics Institute”. The course, titled “Common Mistakes in Using Statistics – Spotting Them and Avoiding Them”, seemed to become popular with a lot of people using statistics. For example:
      i. The first time I taught it, a person fairly high up in the state Department of Health and Human Services took the course — and thereafter sent two of her employees each year to take the course.
      ii. The course has usually been the first to fill up each year. (I “retired” from teaching it after a few years, but passed it on to someone else whom I thought would be good for the job; she in turn passed it on to someone else when it went “online” last year because of the pandemic.)

      I have notes for the course posted at https://web.ma.utexas.edu/users/mks/CommonMistakes2016/commonmistakeshome2016.html . (There may be some broken links; sorry about that .)

      I think that the title of the course has been important: people come into the course realizing that they may be making mistakes in their use of statistics — and a lot of people dislike making mistakes, so are motivated by the opportunity to get help in avoiding making mistakes.

  5. Andrew’s example of the OR of 3 with a CI of [1.1, 8.2] hits the nail on the head. The use of flat priors, either implicitly (as when interpreting a CI as a credible interval), or explicitly in Bayesian analyses, is hardly ever warranted in the fields I work in (psychology/psychiatry/medicine). Standardized group mean differences (at least replicable ones) are usually in the range of .2 to .7 in my field, over an extremely wide range of research problems from clinical trials to behavioral experiments to basic bench biology. People find huge effect sizes with small samples in a pilot study, and wonder why they don’t hold up in the full study. Or worse, they are statistically significant, so they get published, and then the effect sizes end up being used as point estimates to inform power analyses for new (woefully underpowered) studies. A reasonable prior on the effect size, especially with small-N studies, would go a long way to limit spurious findings. (And integrating over the effect size distribution when calculating sample size estimates would yield much better frequentist power calculations.) But Bayesian methods are often taught using “noninformative” priors, perhaps in an attempt to persuade new users that they are being “objective”. What I see happen all the time is that someone tries out a Bayesian analysis with a simple model using a “default” prior, and ends up with a posterior mode and HDI that are identical to the mean and CI they get from a traditional analysis, and decide it’s just not worth the bother. I think we really need to be thinking a lot more about informative priors.

    • Tom M: “A reasonable prior on the effect size, especially with small-N studies, would go a long way to limit spurious findings.”

      Would it really?

      Advocating for a subjective Bayesian would yield many benefits, but I doubt this would be a consequence. I think to be “subjective Bayesian” means that we’d have to be open minded to a very large class of “reasonable” priors.

      • I wasn’t thinking of subjective priors, but I guess it depends on what you mean by “subjective”. If you’re conducting a clinical trial for a treatment for depression, say, I think most researchers in that field would agree that it would be reasonable for a prior to place a very low probability on finding a treatment effect 2 or 3 times greater than any current treatment. Sure, there would be lots of disagreement about a particular prior, but I think everyone would agree that we do have *some* prior shared knowledge about expected effect sizes that is not captured by a uniform prior. And that shared knowledge is objective because it is based on a lot of other research studies, even if the exact study has never been done before. Formalizing it is hard, but I’d like to see more awareness and debate about the issue. Next time someone shows you means and t-tests from an N of 20 study, ask them why they chose a flat prior :)

  6. There is a apparently a worldwide shortage of biostatisticians. This is puzzling because if it’s true, it doesn’t appear to have affected epidemiological or clinical research output to any great extent since thousands and thousands of peer-reviewed papers are still published every year.

    Here’s what happening. Some clinicians have some data, but they can’t analyze it so they get someone who can run a statistical program (SPSS, SAS, whatever) and they tell them what they want. “We need tables of this variable by these variables”. “We need a logistic regression of variable Y to get an odds ratio for variable X and adjust for these other variables.”

    You see: they are doing the analysis and their statistician is simply processing the data. Whether the statistician is a Bayesian or a frequentist is immaterial.

  7. Authors should read https://hardsci.wordpress.com/2014/12/04/statistics-as-math-statistics-as-tools/
    You keep trying to convince people. I don’t think you know how that works.
    Very off-putting arrogance in that article.

    I agree with its drift, but in what world do you think anybody will change their ways after reading about those contrived examples?
    Acknowledge that many people know that the outer edges of a confidence interval in small data aren’t plausible and temper their interpretations, just like their readers do.
    Acknowledge that some people are still working among those who deny uncertainty and variation (posted here two days ago for Christ’s sake). For them, an interval (as opposed to a *) is a big improvement. These are the people (the choir) who might pick up your paper and you’re alienating them.

    Compliment them for already doing something better than most. Show them how confidence intervals may mislead in their research.
    Design software that allows people to quantify their prior belief and put it in the right form.
    Make an R package that supplements frequentist functions with Bayesian ones, so that people can see side-by-side how things would change.
    Make the right thing easy. Not everybody can be a specialist or hire one.

    • Geno:

      Without commenting, positively or negatively, on the tone of Morey’s article, let me say that there’s research and there’s rhetoric. The two can sometimes be combined—and I agree that, all other things equal, it’s good to be as persuasive and convincing as possible. But there’s something to be said for bluntness when a topic is still up in the air and we’re still trying to figure things out. I think it’s just fine for a researcher to state his or her position without trying to get anyone to change their ways. Once we have a better take on the problem, someone can always come back with the user-friendly version, complimenting people and all the rest. There’s definitely room for that sort of persuasive presentation, and there’s also room for blunt clarity. And, also, as you say, writing an R package is a good idea. Someone should do that too.

      • genobollocks and sanjay are SPOT ON. and your response is, “Someone should do that too.”? if you really want people to start being better scientist and data analyst then you’d do well to spend less time on this blog and put time into giving people concrete tools they can use to implement the methods and approaches you advocate. sure, there is no simple ‘cook book’ for good scientific and data analytic practice, but if you want people to use certain tools and avoid others, you have to do more than just describe how great the good tools are and how deficient the bad tools are. you have to give them the tools. and books/papers/blogs are not tools, they are instruction manuals. you can’t hammer a nail with a piece of paper that describes the best technique for hammering a nail. you need an actual hammer.

        • Sentinel:

          You seem to have a pretty good idea of what you’re looking for, and you seem to be pretty clear that it’s not what I’m doing, so maybe you should develop these tools yourself. Blog commenting is fun, but, hey, go make some tools already! In the meantime, I think I’ll continue writing and coding, developing methods, working on examples, reflecting on the work of others, and generalizing where possible. We each have to do what we can.

          Best of luck with the hammer construction.

        • Yeah that was a bit funny. Apart from Stan, I think what AG does here is one of the most educative experiences in statistics for me (mostly because I actually read and get it as opposed to more condensed texts).
          Morey also developed an R package, BayesFactor, that makes things more user-friendly. I hadn’t checked it out last night (off-putting tone…), but I now did and actually they do give functions called ttestBF, anovaBF, regressionBF.

          > when a topic is still up in the air and we’re still trying to figure things out
          Maybe this statement is true for some people. Maybe I was the beneficiary of a particularly accurate frequentist statistics education. At my uni, you can wake any psy student at night and they can regurgitate “a 95% confidence does not contain the true value with 95% probability, if we were to make a CI hundreds of times, it would overlap the true value 95% of the time”. Even explain this.
          This sort of teaching definitely contributes to people perceiving statistics as “not for normal people”, so I was happy when I found out how much more intuitive things are in Bayesian statistics.
          So, my position is: some people know about these deficiencies and have known about them for a long time. I think these people are a large group.
          They will be put off by that article and say stuff like “well I’m a real world psychologist and my data are normal, I don’t care about your stupid submarines”.
          And they will be wrong, because they’d benefit from switching. See what Shravan wrote below, for example they wouldn’t have to go through mental contortions every time they explain what a CI really is. It can help them draw more accurate conclusions from small data (some people seem to loove small data).

          For that matter, I don’t think the problem statement was blunt or clear. Don’t show how Bayesian statistics can save submariner’s lives. Psychologists don’t save submariners. We torture data and sometimes people too ;-)

          There’s a section there called “Student’s t interval”. It’s a mess. Unreadable, really. They say:
          1. now a real example because people will say our submarine example was unrealistic
          1a. actually these people (you, the reader) are idiots, because showing that CIs break in some cases is totally sufficient to advocate switching
          1b. actually these people (you, the reader) are even worse idiots, because they misinterpret CIs (look, we found a guy, Cummings, 2014).
          2. but ok, here’s a real-life example for you fools: N=2 drawn from a normal, 50% CI.
          3. oh you were expecting something drawn from real-life research, related to what people actually do? Maybe you were expecting us to tell you a vivid story to engage you? Didn’t you read 1a and 1b? You’re an idiot.

          On a different note, and to show that I don’t think I’m above learning from someone a little more polite, I really would like some good education on the matter. I’ve been using lme4::glmer a lot recently (empirical Bayes?!). What mistakes am I making if I am interpreting the confidence intervals I get as “uncertainty intervals”. Does that change depending on whether I use bootstrapped, profile or Wald intervals? I actually have clear prior information how large I expect the effect size for the main predictor to be, but I also have a 6 digit N. Should I bother?
          I think none of what Morey et al. wrote matters with regard to these questions, but you can see that I’m clearly not educated enough to really know. For example they talk about non-normal data violating some sort of assumption. I have non-normal data, but this doesn’t make my uncertainty (?) intervals off as far as I can tell.

        • Re: “off-putting tone” and “more polite”, I think these perceptions are interesting. The tone in our manuscript is actually not different from what one might often find in an academic talk. Colorful quotes, examples, etc; the perception that it is “impolite” is driven largely by the context: it is a manuscript. When writing this manuscript, I aimed for a more talk-like style to try to engage readers who might not otherwise read an entire paper about CIs. Besides, readers are not monolithic. What doesn’t work for you, works for others. Writers have to make choices for a diverse readership. YMMV.

        • Maybe I am in the subset of readers who know what the proper contorted definition of a CI is, but still don’t feel like they’re part of the Bayesian crowd, and hence manage to feel offended where few others would be.
          I’d hazard that this subset is actually quite large. Another even larger subset, those who are actually using CIs as uncertainty intervals would probably be embarrassed and on the defensive after reading your manuscript.

          In a talk, you know your audience a bit better and it’s clearer to them whether you’re just a huge Princess Bride nerd using that quote tongue-in-cheek or if you’re that annoying guy who corrects people when they say literally for figurative things.

          I meant the above as constructive feedback. You obviously have both in you, making things easy for people to use with BayesFactor, and making truths hard to accept to those who don’t embrace them yet with a manuscript that panders to those that already agree with it.

          I really think the Student’s t section would be much improved by avoiding the repetition of the argument (CIs being wrong sometimes is a reason not to use CIs as a matter of principle).
          If you’re going to provide a real research example (you obviously feel like you have to), use a proper one and use it as an additional argument. Now you have the argument from principle and the argument from being misled in real research.
          The way you discuss it still feels like I could just get away with using CIs if my data is normal and I don’t care about your principles just about being right in the application I have.

        • I’m just going to comment here that “(look, we found a guy, Cummings, 2014)” seems to ignore the fact that Cummings (2014) are essentially the APA’s new statistical guidelines.

  8. These very, very old misinterpretations and rigged examples (e.g., from Welch, and Berger and Wolpert) have been shown for the howlers they are in many published papers not cited. (Berger will quickly back off them once shown the stipulations that are violated.) On the confidence fallacy (fallacy of probabilistic instantiation)–which is the basis for Bayesian epistemology–it’s Fisher who explicitly commits it (check Fisher 1955). Mistaken interpretations throughout this post and paper are too numerous and basic to delineate. But sure keep cranking more publications repeating them. Too bad (as Neyman would say).

    • Mayo:

      It’s sad but true that various errors are well known but keep being made. Like it or not, people do interpret conf intervals as uncertainty statements, and people do use the “conf interval excluding 0” rule as a way to make decisions and claim that an effect is true. So I think it’s valuable for Morey et al. to point out these sorts of issues, as long as there are people who think that the use of conf intervals rather than p-values solves the problems of null hypothesis significance testing.

      And, as I’ve said many times, I don’t think Bayes factors or noninformative Bayesian inference solve the problems of null hypothesis significance testing, either. I have the feeling that we’ll be having these problems as long as people try to prove pet hypothesis A by rejecting straw-man hypothesis B.

      I like how Kaiser put it, with the man in the gray flannel hypothesis.

      • I take issue with this “straw-man” hypothesis stuff, as it is a straw-man itself.

        A straw-man null is a symptom of bad research design not of hypothesis testing per se.

        • Fernando:

          The straw-man is something built in order to be destroyed (dictionary definition: “a sham argument set up to be defeated”). The “straw-man hypothesis stuff” which I (and others) discuss is, unfortunately, done all the time. It is, indeed, a standard tool in data analysis.

          But, just to keep things clean, simply remove the phrase “straw man” from my comment above and everything still holds. That is, I have the feeling that we’ll be having these problems with null hypothesis significance testing as long as people try to prove pet hypothesis A by rejecting hypothesis B.

          That is, the problem as I see it arises from the idea that you can prove a very general hypothesis A by rejecting a very specific hypothesis B. For the purpose of this discussion you can forget about B being a straw man.

        • Fernando,

          Point to a paper that does not uses NSHT but not a strawman null hypothesis. IE the null hypothesis is predicted by the researcher’s theory. This is extremely rare.

        • Fernando,

          Do you ever expect two different treatments to have the exact same effect? Or for the two groups to really come from the exact same population and stay that way throughout the study? No.

          We only care about estimating parameters, whether they appear to be different in some way relevant to practice, and whether there is reason to doubt that the treatment is the primary cause of the observed difference (ie explore the data and look for alternative explanations such as procedural artifacts, etc).

          The whole null hypothesis thing is superfluous to our goals and only serves to confuse people (because why were they taught to do it if it’s pointless?). An additional aspect that doesn’t necessarily follow from NHST, but occurs in practice, is that it seems to facilitate hiding possible conflicting information behind the averages they used to do the t-test or whatever.

        • Who says they have to be exactly the same? You could test the hypothesis that treatment A is at least twice as effective as B.

          Typically that is what you care about, not whether two parameters are exactly the same, or exactly zero. You can be more creative.

        • “Who says they have to be exactly the same? You could test the hypothesis that treatment A is at least twice as effective as B.

          Typically that is what you care about, not whether two parameters are exactly the same, or exactly zero. You can be more creative.”

          Yes, that would be a very simple way to increase the utility of NHST, set the null hypothesis at the level of practical/clinical significance. But can you find one example of someone doing that? In the case of testing medical treatments I would guess zero examples are published.

        • I agree. I just think we need to distinguish between what NHST can do, and how it is actually used in practice.

          My sense is that anything that is used _en masse_ will be misused, including credible intervals.

          In general I prefer a combination approach that starts with a sharp null, data plots, modeling, estimation, and simulation.

        • Does anyone go into a comparative drug trial with the precise and scientifically based hypothesis that the two drugs are exactly equally good??

          I think question is thinking of the kind of thing where you have some standard model of particles and interactions, you predict that in a lot of particle collisions you will occasionally produce exotic particle FOO and the theory predicts that the mass of FOO should be EXACTLY equal to the mass of well known particle BAR.

          In such a situation, you have a precise sharp and fundamental hypothesis that M(FOO) = M(BAR) otherwise the model is broken, and if you can show that the data makes this prediction unlikely you can topple a whole bunch of assumptions about how the world works… that’s pretty darn rare as he (she?) mentions.

        • In the context of drug efficacy, you might have this if you’ve got some known cocktail of N compounds, and you hypothesize that compound X is the only active ingredient, so you give the cocktail to one group and the compound X purified to another… but this is again, very rarely the kind of thing we’re doing in most real-world usages of hypothesis testing.

    • I am trying to follow the work by Wagenmakers, Krushke and others on NHST, confidence intervals, star gazing etc.
      Many of these articles suggest that the problems were described a long time ago but remain underappreciated. Some articles suggest easy ways out or state how things should be done.

      As someone not really involved and with only superficial knowledge, I would like to add that people underestimate problems outsiders have in understanding the subtleties. Add to that standards in academic fields and journals which make it hard to break out of certain habits. As an example: Recently, I (as a very junior researcher) invested substantial time in avoiding displaying estimation results in overly long tables. I made graphs to show the results and made model check visualizations. I heard back from a journal that they had issues with the way I presented the results. One of the graphs resembled a graph from a journal I referenced too (avoid plagiarism, give credit to the inspiration of a graph), but in general they wanted to just see long regression tables.
      I cannot imagine what will happen once I start abandon some other practices.

      Maybe, I would try to, but first I would need to have a clear idea on how to do it. I miss applied articles, published in non-methodological journals, where the authors show novel (no replication) research using the practices they preach. That would help clearing the path.

      A last remark to Mayo: Maybe it because I am a non-native speaker, but your post is hard to grasp. If you would like to a have a wide audience, it would help to be a bit more transparant. I mean this in the best possible way. The only thing I could deduce from your comment is that what they wrote is wrong and something about cranking publications.

      • “the problems were described a long time ago but remain underappreciated.”

        I don’t think anyone has been able to describe the primary problem better than was done here:
        Meehl, Paul E. (1967). “Theory-Testing in Psychology and Physics: A Methodological Paradox”. Philosophy of Science 34 (2): 103–115. doi:10.1086/288135

      • I think there is a basic issue that the subtleties of the reasoning behind NHST are very difficult for either textbook writers or instructors to explain in simple English sentences. That combined with the way people are taught in those introductory classes where it is introduced means that in a lot of cases wrong ideas are learned that are only possibly undone for people who take much more statistics later. If I am looking at a possible textbook this is usually the first section I check.

        • Agreed. In fact, these subtleties may even be impossible to explain in “simple” sentences in any language.

          This dilemma comes up in another situation: people who do statistical analyses and need to explain their results to their boss, who will not accept anything that is not simple. I sometimes tell my students, “You need to understand the precise definition, but you may need to use something that just gives the general idea to communicate to your boss,” and for that purpose, I think something like “a range of plausible values” is an acceptable description of a confidence interval. But I’m open to other suggestions.

          There was recently a discussion in the Statistics Education Section of the American Statisical Association on ways to describe confidence intervals — not just precise definitions, but also metaphors/analogies that can help give the idea (e.g., in intro stats — or to one’s boss). It’s at http://community.amstat.org/communities/alldiscussions/viewthread/?GroupId=1783&MID=21758. I don’t know if you need to belong the Stat Ed section or just to the ASA to see it. I found it interesting to read different people’s ideas, and the back and forth of correcting misinterpretations, and sometimes modifying analogies to give better explanations.

        • Sounds similar to some of the points Feynman made about the unusual efficacy of math in describing physics. Some of the concepts in Phycics you can translate into words, & analogies, & try to make it *seem* as if you are describing the Physics itself in sentences.

          For simpler concepts & situations it may indeed work. But there’s no mistaking that the *real* physics is described by a precise math equation. And whatever translation into sentences you do is only an approximation and in some limiting case it always fails.

          So also, for NHST, the biggest problems seem to arise when someone tries to translate a complex situation into “simple” sentences. That’s where the biggest errors & pitfalls lie.

        • The last I looked at intro textbooks (2 or 3 years ago), the one that seemed the best of those I looked at was DeVeaux, Velleman, and Bock, Stats: Data and Models, 3rd ed. I’ve got some comments on using it in a particular course at http://www.ma.utexas.edu/users/mks/M358KInstr/M358KInstructorMaterials.html . However, that course is for math majors, so includes more mathy stuff than the usual intro stat course. But possibly one of the other books by the same authors might be good for a more standard intro stats course.

          For many years, Moore and McCabe’s Introduction to the Practice of Statistics was pretty good, but it seems to have gone downhill since Moore retired and a third author was added.

    • Mayo:
      I am not surprised that you disagree, but I think it would be interesting to understand the nature of your disagreement. Do you think that we are wrong that the three errors we identify are, in fact, errors (ie, they are not logical implications of “confidence” procedures as Neyman defined them)?

      Echoing what I said above, I think Neyman did good, careful work and I think he was very clear and principled. I personally don’t have any use for confidence intervals, but regardless, surely you agree that the first step to having a *good* debate about confidence is delineating its logical implications, and ensuring that people don’t confuse them with, say, fiducial or objective Bayesian intervals?

      • Since Mayo has not responded, perhaps I can answer part of my own question through Mayo’s work. Mayo (1982; available from Mayo’s website) writes:

        “It must be stressed, however, that having seen the value x, NP theory never permits one to conclude that the specific confidence interval formed covers the true value of 0 with either (1 – alpha)100% probability or (1 – alpha)100% degree of confidence. Seidenfeld’s remark seems rooted in a (not uncommon) desire for NP confidence intervals to provide something which they cannot legitimately provide; namely, a measure of the degree of probability, belief, or support that an unknown parameter value lies in a specific interval. Following Savage (1962), the probability that a parameter lies in a specific interval may be referred to as a measure of final precision. While a measure of final precision may seem desirable, and while confidence levels are often (wrongly) interpreted as providing such a measure, no such interpretation is warranted. Admittedly, such a misinterpretation is encouraged by the word ‘confidence’. ” (p 272)

        Mayo is here is accusing Seidenfeld of committing what we call the fundamental confidence fallacy. (In our article, we also make the point that the FCF is encouraged by the word ‘confidence’). Mayo’s remarks about the CIs being inappropriate for “final” precision can of course be extended to what we call the likelihood and precision fallacies, since those are “final” judgments as well, and CIs are just as inappropriate for these. So it seems that Mayo agrees with us on our three fallacies (unless I misunderstand her, or she has changed her mind).

        In response to Seidenfeld’s counter-intuitive example that showed a similar disconnect between “confidence” and “final precision”, Mayo writes:

        “To this the NP theorist could reply that he never intended for a confidence level to be interpreted as a measure of final precision; and that he never attempted to supply such a measure, believing, as he does, that such measures are illegitimate. It is not the fault of NP theory that by misinterpreting confidence levels an invalid measure of final precision results.” (p 273)

        This echoes (or, rather, we echo this, since she was first…) our statement that “Confidence procedures were merely designed to allow the analyst to make certain kinds of dichotomous statements about whether an interval contains the true value, in such a way that the statements are true a fixed proportion of the time *on average* (Neyman, 1937). Expecting them to do anything else is expecting too much.” As Bayesians, we disagree with Mayo on the value of measures of final precision, of course, but we definitely agree on the problem of misinterpreting confidence intervals.

        Finally, Mayo also writes:

        “In my own estimation, the NP solution to the problem of inverse inference can provide an adequate inductive logic, and NP confidence intervals can be interpreted in a way which is both legitimate and useful for making inferences. But much remains to be done in setting out the logic of confidence interval estimation before this claim can be supported — a task which requires a separate paper.” (p 273)

        Mayo, almost 50 years after the theory of CIs was laid out, is saying that “much remains to be done” before one can support the statement that NP confidence theory is “both legitimate and useful for making inferences.” I found this to be a fairly staggering statement from a frequentist, to be sure. We disagree with Mayo’s estimation that this can be done — and are unaware of any work after in 32 years after this paper that did — but it would certainly be valuable work trying.

  9. I have some comments on this paper. (I should say that I do understand the problems with CIs, and I report 95% credible intervals in my own published work. So I didn’t need to be convinced to switch to credible intervals.)

    I think that the submarine example is just a distraction. As a reader I don’t want to have to absorb a “real life” example to see the point. Just show the widely separated data points vs narrowly separated data points (the two bubbles). When I saw this example, I was distracted by the premises. Why am I allowed only one attempt? Why do we see only two bubbles (is that even physically realistic, that only two bubbles come up)? Why don’t the bubbles come from near the hatch? How are these bubbles even being generated (how can they be evenly generated from the submarine’s entire length; what about marine life generating false positive bubbles)? Also, in real terms, I never have such sparse information. The fact that one has to devise such an artificial example to make the point makes the reader suspect that one is scraping the bottom of the barrel, as it were.

    The authors say that they want to keep the example simple, and therefore have N=2. But I never have N=2. I don’t think it helps me to understand the point with this artificial example. Even worse, in the second example (where the bubbles are spread apart) the credible interval is computed using special knowledge about the problem (the width of the submarine limits the range of possible distance between the bubbles); I have never been in that situation in my life where I could say, yeah, I know the lower and upper bound of possible values.

    Related to this, in all the research I do, I always plot my confidence intervals (frequentist) and Bayesian credible intervals side by side. They are always nearly identical, with Bayesian CrIs often a bit wider. In my two years of fitting Bayesian models, I have never been in the situation that the authors describe, where the two would be so wildly different. So, if I were a frequentist that needed convincing to start plotting CrIs, this paper, especially Figure 2, would not convince me, simply because it doesn’t relate to anything that ever happens in my life as a researcher.

    The way I justify the use of CrIs rather than CIs in my classes is to tell students that what they *think* that CIs are giving them is what CrIs are giving them, and so the latter are the more sensible thing to show. Surely that’s enough justification? Also, my justification for using Bayesian models (in my classes) is that they allow for more flexibility. One has to concede, however, that when one has a lot of data (which I usually do), the statistically justifiable *decision* we can make based on a frequentist model vs a Bayesian model is not going to be that different. It’s an unfortunate fact that frequentist usage is so distorted (a misunderstanding about what p-values tell you, arguing for null results, not checking model assumptions, etc. etc.). But that’s not the fault of the frequentist method; it’s the fault of the user, or perhaps I should say, the mis-user.

    • Shravan:

      Here I am responding to blog comments at 2:30 in the morning . . . it’s because I’m using it as a distraction from some real work, I have a bunch of stuff due in the next couple of days . . .

      Anyway, I just want to say it’s very easy to get a Bayesian posterior interval that’s much much narrower than the classical confidence interval. All you need is a case where you have strong prior information.

      For example, let theta be the difference in Obama support for some category of U.S. women of childbearing age, comparing two different parts of their monthly cycle. My prior, based on political science knowledge, is something like N(0,sigma^2) (or, in stan notation, normal(0,sigma)), with sigma some small value such as 0.001 (that is, 1/10th of one percent) or, hey, let’s go wild, 1.0 (in the unlikely event that the effect is that large). Meanwhile the data from a study published in Psychological Science gives a 95% confidence interval of something like [0.03, 0.31]. Put these together and you get a posterior interval that’s pretty close to normal(0,sigma): the posterior sd is a little bit less than sigma and the posterior mean is a little bit more than 0. The point is that the posterior interval is much much different than the confidence interval.

      And this is a real example. Maybe I should blog it. After all, who reads the comments?

      OK, now back to work.

      P.S. My new preferred term for all these is “uncertainty interval.” This makes sense, no? The wider the interval, the bigger the uncertainty. “Confidence” or “credible” interval goes in the wrong direction.

      • But I agree that they are both very different things. Perhaps I should have been more precise: If you fit a Bayesian hierarchical model, vs a frequentist linear mixed model, and if you have lots of data, and if you use uninformative or mildly uninformative priors, then you will get very similar intervals in the frequentist and Bayesian models. Once informative priors are brought into the game, of course the story would change, but so far I’ve never used informative priors when doing a Bayesian analysis (except in one Bayesian meta-analysis). So, my point is that in the usual case (and what I do is fairly typical stuff in psychology and psycholinguistics), the intervals will be very similar (under the conditions above). They still have rather different interpretations, sure, and sure, I would prefer uncertainty intervals (I like that term, much more descriptive; I’m going to use it in my next paper, if I may).

      • I guess I prefer the other term you use, ‘posterior interval’, because your uncertainty will not be mine if I hold a different prior. ‘Posterior’ focuses attention to the model. Though I could live with ‘uncertainty interval’ too. And hey, who doesn’t read the comments, this discussion is great!

      • You might have heard Don Rubin recalling his conversation with Neyman, where Neyman said he was just trying to get an interval that would be OK no matter how bad the prior was and he achieved this in some simple examples by not using a prior at all. Larry Wasserman made this precise on his blog by indicating the requirement that coverage be uniform over the whole parameter space for in interval to be a confidence interval.

        As you know, credible intervals, especially with informative priors, will have varying and poor coverage in the tails. Uniform coverage is nice (as you can be wrong about the prior) and certainly worth it if there is little cost involved in getting it. When you have credible informative prior information the cost can be unacceptably high. But if you don’t (i.e. using uninformative or weakly informative) there is little cost and the credible intervals you get are approximate confidence intervals anyway (but missing the analytic guarantees of uniform coverage).

        So “uncertainty intervals” with footnotes about whether you have taken on some well considered risk of non-uniform coverage or simply defaulted to uniform coverage either with or without analytic guarantees on that.

        OK, I’ll get back to work now.

  10. I think it is generally very important that we understand shortcomings of every type of statistical inference in any of the settings we might encounter. I therefore welcome efforts of scholars to elucidate the behavior of our inferential methods (I may digress but I don’t understand the need of people who point out perceived flaws in methods to lean towards taking partisan stances – no type of statistical inference is always better in every setting, so the effort should go into finding out when each of them works well. I find this horserace attitude crippling to our discipline.). It is therefore vital to know when and how CIs work well and when they don’t.

    But anyway, whenever I read something along the lines of that it is not true that “the parameter values contained within a CI are thought to be more plausible than those outside the interval; and the confidence coefficient of the interval (typically 95%) is thought to index the plausibility that the true parameter is included in the interval” and that “CIs generally lead to incoherent inferences”, I always like to suggest the following game which is cast in a decision framework (and – with tongue firmly in cheek – I invite Morey et al to it):

    If Person 1 says that a CI says nothing or little about what may be plausible values for the real parameter, then we can make the following bet: We draw a random sample with known parameter. We then calculate the, say, 99% CI. Person 1 – believing the CI says nothing about the plausibility of values of the parameter – gets a certain payout if the CI does not conatin the real parameter. I (Person 2) get the same payout if the CI contains the real parameter. The question I pose to Person 1 is: Are your game?

    Variants include playing the game repeatedly or asking Person 1 how much money she would want to get as payout for both Person 1 and Person 2 for her playing the game.

    Here is the R Code or trying it out. Payout is 100 Dollars each, we play 1000 times.

    ci_game <- function(mu,sd,n=100,conf.level=0.99,payoutInCi=100,payoutNotInCi=100,verbose=FALSE,reps=1)
    {
    run <- 1
    payoutProCI <- 0
    payoutConCI <- 0
    while(run <=reps)
    {
    x <- rnorm(n,mean = mu,sd = sd)
    game <- t.test(x,conf.level=conf.level)
    if(game$conf.int[1]<=mu && mu <=game$conf.int[2]) {
    payoutProCI <- payoutProCI+payoutInCi
    if(verbose) cat("Mu is in the CI.",payoutInCi,"dollar go to the one who bet on the CI.\n")
    }
    if(mu=game$conf.int[2]) {
    payoutConCI <- payoutConCI+payoutNotInCi
    if(verbose) cat("Mu is not in the CI.", payoutNotInCi,"dollar go to the one who bet against the CI.\n")
    }
    run <- run+1
    }
    list("PayoutPro"=payoutProCI,"payoutCon"=payoutConCI)
    }

    mu <- 2.7
    sd <- 1.5
    n <- 100

    set.seed(100)
    ci_game(mu,sd,conf.level=0.99,reps=1000,verbose=FALSE)
    $PayoutPro
    [1] 99300

    $payoutCon
    [1] 700

    If Person 1 would now like to wire 98600 to me, I can give you my paypal contact.

    • In the quotes in the second paragraph above I overlooked the “thought”, so it reads not as I intended. It should be this: But anyway, whenever I read something along the lines of that it is not true that “the parameter values contained within a CI is more plausible than those outside the interval; and the confidence coefficient of the interval (typically 95%) is the plausibility that the true parameter is included in the interval” and that “CIs generally lead to incoherent inferences”.

      • It would be more realistic if you had it randomly choose a distribution (or mixture) to sample from (but always assume normal). Also, add some random amount of systematic error.

        • Furthermore, play the game exactly once, and choose a method for generating the confidence interval that is a valid Confidence Procedure randomly and uniformly from among about 20,000 different methods.

        • Thanks question & Daniel Lakeland. Let me be clear: I’m not talking about robustness of CI procedures, nor the problem of the model having to be true nor any of the assumptions that one must take nor the problem of selecting the proper model from an inifinite space of possible models.

          The game is unrealistic precisely because it allows to get these problems out and let’s one evaluate what a CI does. I was talking about the often uttered claim that a confidence interval does not give an interval of plausible values for the real parameter (however one would want to precisely define “plausible”).

          Daniel: If you use ci_game(mu,sd,conf.level=0.99,reps=1,verbose=FALSE) it is just one trial. I did just that (the seed is only for reproducibility).

          set.seed(666)
          ci_game(mu,sd,conf.level=0.99,reps=1,payoutInCi=1000,verbose=FALSE)

          $PayoutPro
          [1] 1000

          $payoutCon
          [1] 0

        • with a bad choice of confidence procedure, it’s possible to get an interval that is entirely contained in a region of logically impossible values. If the confidence interval construction method is correct, it can’t do this more than 1-alpha of the time… but in a particular case it can. Bayesian intervals from models that give zero density to logically impossible values can NEVER do this.

          I think the proper statement is “confidence interval procedure does not necessarily give an interval of plausible values in any given case, and almost never gives an interval of equally plausible values”

        • Daniel,

          If across repeated uses of the procedure in a given situation it doesn’t capture the true value 95% of the time then by definition it isn’t a confidence procedure for that situation. So that’s a pretty specious argument. The procedure for getting a proper credible interval can’t be misapplied either.

        • He’s talking about a correct confidence procedure for that situation. The CI’s do get the true value 95% of the time, it’s just sometimes the CI’s lie entirely in regions which are impossible (according to the same assumptions as used to construct the CI).

          What’s happening is the CI’s are chasing after that 95% coverage. In order to get it in the long the normal unexceptional CI procedure will generate absurd intervals for certain data points. The insane part of this is that given the actual data, you know whether you’re in one of those bad cases. The Bayesian Credibility intervals are automatically taking this into consideration and giving the optimal interval estimate for the data actually seen.

          More to the point, if you’re allowed to vary the bet sizes in Momo’s bet above, a Bayesian can use this fact to bankrupt the Frequentist very quickly (unless of course the CI’s are always equal to the equivalent Bayesian Credibility Interval).

          Chalk one up for Bayes.

    • I think your bet would be a real eye-opener to someone is fundamentally mistaken about what a confidence procedure guarantees – i.e. it’s very definition – because your bet (by design) tests this fundamental guarantee, and bleeds money quickly from any “person 1” who doubt it.

      While I can only assume that you encounter people in “person 1″‘s state frequently, isn’t
      potentiallly insulting to raise it here? It could be taken as suggesting that Morey et al are similarly situated to person 1,
      which is surely bizarre – clearly they understand perfectly well what a confidence procedure gives you and why your bet merely illustrates this strength.

      Morey at all are very clearly warning against errors some people make, when they take a concrete confidence interval – some specific [a, b] given by one sample, and read twoo much into that specific interval (e.g. ‘look, “a” is close to “b” so my estimate is a particularly precise one’). The warnings seem to make perfect sense to me, and are 100% consistent with understanding and acknowledging the merits of a confidence procedure.

      If you have doubts about the relevance/correctness/usefulness or whatever of their warnings and examples, do you have a way of challenging this (the “bet” form is nice) which isn’t tantamount to “they don’t even know what a confidence procedure does”?

      • I’m not suggesting that Morey et al or any other of the people reading this are not aware of what a CI is. I’m pretty sure they know more about it than I. Nor do I intend any insult. I also welcome their warnings and examples and find them very valuable to point out when CI may not work as widely believed (as my first paragraph indicated).

        But the often implicitly or explicitly stated claim that a CI is worthless to infer a range of plausible values for the real parameter (given it is the “correct” CI) imo simply does not hold up and the game illustrates that.

        • > But the often implicitly or explicitly stated claim that a CI is worthless to infer a range of plausible values for the real parameter

          Can you give any citation? (Since this is an “often” claim this should be easy.) Bottom line, you are introducing some “often” people stating outlandish claims, ignorant of what confidence procedures _do_ give you, in a discussion of people who are _not_ making these outlandish claims, are not subject to this error, and are trying to say something more subtle. Let’s be clear: your “person 1” is making a fairly clear, fairly objective, provable mistake. But you suggest that there are many people such? I’d like some evidence, since that seems crazy.

          N.b. IMO in discussions like this you would be well served to use Morey et al’s distinction between a confidence
          procedure and an confidence interval. I’ve never heard of anyone credible, with basic stats training, say that the
          results of a CP are worthless. I’ve heard people say that sometimes the particular CI’s you get aren’t so informative, and there are well known examples where a particular CI you might obtain can convey minimal or even misleading information. Are you really really sure you aren’t confusing “sometimes particular CI’s are low/zero information” (true) with “some CP’s are worthless” (rather silly). Again, would be good to see an example to see that this is not a straw man.

        • bxg,

          The backwards conclusion Morey et al make at the end of their paper is, “we have shown that confidence in- tervals do not have the properties that are often claimed on their behalf” and it’s not subtle at all. That’s not about the specific intervals they looked at. They’re making a sweeping generalization and it’s using logic equally as bad as the FCF fallacy (using their terminology).

        • Thanks bsg for giving me a chance to clarify my position.

          I was reluctant to start pointing fingers at people making “outlandish claims” and I debated whether starting with that now (as I don’t think it helps much for the overall discussion). But since I consider myself an observant person that likes basing her statements on some evidence, I want to combat the “straw man argument” accusation. Here is a great example of what I’m talking about: http://jakevdp.github.io/blog/2014/06/12/frequentism-and-bayesianism-3-confidence-credibility/

          That person is clearly credible, clever and has more than basic stats training. Still he claims things like (I just take some statements):”Frequentism [in the context of a CI – added] and Science do not mix” or “The frequentist confidence interval […] is _usually_ [emphasis added] answering the wrong question” or “[…] if you follow the frequentists [idea of a CI – added] in considering “data of this sort”, you are in danger at arriving at an answer that tells you _nothing meaningful_ [emphasis added] about the particular data you have measured.” or “‘Given this observed data, I can put no constraint on the value of \(\theta\)’ If you’re interested in what your particular, observed data are telling you, frequentism [in the context of a CI – added] is _useless_.” Morey et al also claim “confidence intervals do not have the properties that are often claimed on their behalf” and the blog post here is entitled “fallacy of confidence in CI” (of both latter statements I think they were only chosen for reasons of rhetoric rather than substantive reasons).

          To wrap my position up. I’m convinced that
          a) CI can be useful procedures for scientific inference about plausible values of the unknown parameter
          b) It is very important we know when and why and how they are and are not working well (just as it applies to every other inferential procedure).

          I think we (and most if not all others on this blog) will agree with a) and b). For those who do not agree with a) (like the blogger I cited here), I suggest the game above.

    • Larry:

      That might be how you construct confidence intervals but it’s not the only way. As discussed in various places, confidence intervals are commonly interpreted as statements of uncertainty. But, as Morey et al. discuss in their article (and as I illustrate with my graph above), confidence intervals formed by inverting hypothesis tests can not in general support this interpretation.

      There are different things going on but fundamentally the problem is that the confidence interval and the hypothesis are model-based, and to the extent that the model is not correct, the procedures can fail in various non-intuitive ways (for example the case shown above, in which, as the model becomes less and less supported by the data, the interval becomes increasingly narrow and thus in practice will be taken as implying stronger and stronger evidence).

      The theory of construction interval estimation by hypothesis-test inversion tests is important but it’s not the whole story. It’s an approach that works in some settings but not in others.

      • The model based critique works equally well for Bayesianism. Unless you are modelling model uncertainty by putting priors on all possible models. The latter is impractical and seldom done.

        Bayes formula is all conditional on assumptions, as you explained in class. Yet seldom are these shown explicitly.

        PS just playin Devil’s advocate.

        • “Unless you are modelling model uncertainty by putting priors on all possible models. The latter is impractical and seldom done.”

          I don’t think that “all possible models” can be made to form a probability space, but Bayesian nonparametrics provide methods that, functionally speaking, come pretty darn close. The practicality of these methods is open to dispute, of course, as is what frequency of use counts as “seldom”; still, I feel that you might be overstating your case.

          I’m curious as to what assumptions you see as unstated — did you mean the assumptions that ground the model, or the assumptions that ground the entire Bayesian approach, or something else entirely? (Not disputing that both of these often go unstated; just trying to figure out what you meant.)

      • I am just saying that, mathematically, they are in one-to-one correspondence.
        Every test defines a confidence interval and vice-versa.
        So saying you don’t like confidence intervals “obtained by inverting a test” doesn’t make sense.
        All confidence intervals can be written as the inversion of some test.

        We have had this conversation before.
        I think what you might mean is:
        You don’t like confidence intervals obtained from goodness of fit tests.
        Is this what you mean?

        • Larry:

          What I mean is that there is more than one way to define and construct a confidence interval. Inverting a hypothesis test is one way but it is not the other way. For example when we run logistic regression and get estimates and confidence intervals, these intervals are not formed by inverting a test.

          I’m also saying the point that you picked up on, that inverting a hypothesis test is sometimes not a good way to construct a confidence interval. It’s not that I dislike all such intervals, it’s just that these intervals can have problems. All statistical methods can have problems, indeed I’ve written many times about Bayesian methods that lead to problems. So my claim is not that inverting hypothesis tests is always bad, I’m just echoing Morey et al.’s point that people should be careful about natural-seeming but sometimes incorrect reasoning, such as interpreting the width of a confidence interval as a measure of uncertainty.

        • Larry, and Andrew, it feels like you are talking past each other, can I try an interpretation and see if it makes sense:

          Andrew: thinking of a known/standard/modified test procedure, and then inverting it to form a confidence interval isn’t the only way to form confidence intervals, you can also think up some other procedure than what’s typically discussed in the standard hypothesis testing literature, and form valid confidence intervals.

          Larry: Every confidence interval that has the confidence property (ie. N% of the time it’s performed it will contain the true value of the parameter when the assumptions hold) can be constructed using a confidence procedure (terminology from the above paper) and such a confidence procedure can always be *interpreted as* a possibly nonstandard or unusual but still valid hypothesis test. Therefore, every confidence interval is generated from SOME valid testing procedure. Every confidence procedure can be inverted into a testing procedure.

          With these two views on the table, my impression is that Andrew is mainly talking about using off-the-shelf or slightly modified testing procedures to invert into confidence intervals… in other words, if you want a confidence procedure… you might as well think one up that is specialized to your particular needs.

          Is that a fair characterization?

        • I cannot speak for Andrew but, yes, you are correct.
          Every confidence interval is the inversion of some test.
          So saying one does not like confidence intervals obtained by
          inverting a test logically rules out every confidence interval.

        • Larry:

          1. Really? When you get a confidence interval from logistic regression, is it the inversion of a test? I suppose it would be possible to invert the confidence interval and back out what the implicit test would be, but that’s not how the confidence interval is actually constructed, right? It’s constructed by taking the point estimate +/- 2 standard errors.

          2. To repeat what I wrote above, I’m nowhere saying that I “do not like confidence intervals obtained by inverting a test.” I’m just saying that I don’t think inversion of tests is in general a good way to obtain such intervals. I do think it can work in some important special cases.

  11. The paper seems like a mixed bag. A few comments:

    1. “three common myths about confidence interval that have been perpetuated by proponents of confidence intervals” (p. 1, second column). I could agree with this if it said “some proponents”

    2. (p. 2, top) Did Neyman really say “State … that Theta is in the interval”??

    3. “All of the points we make in this paper extend to 95% CIs and N greater than 2”. (footnote 2, p. 2) I’m not convinced of this — when you’re dealing with N = 2 in the example, your sampling distribution is t(1) — i.e., Cauchy, which is a pretty unusual distribution. For larger N, you’re not going to get as wide variability in standard errors, so the confidence intervals will not show the wide variability in width that is crucial to the argument here.

    4. However, the submarine with N = 2 does make the points that (a) very small sample size problems need to be approached very carefully (b) additional relevant information (such as the constraints in the example) are important to take into account — and may (as in the example) make frequentist methods inappropriate, but can be taken into account using Bayesian methods.

    5. The discussion on p. 3 fails to take into account the difference between the “everyday” and technical uses of precision. In statistics and scientific fields, precision and accuracy are two different concepts, whereas in every day use, precision might refer to either concept. The width of a CI can legitimately be considered a measure of the precision (in the technical sense) of the interval estimate, but not of the accuracy (which is what the “Myth” assumes).

    6. The discussion of Myth 2 similarly needs to distinguish between everyday and technical uses of words — e.g., likely, plausible, credible, reasonable.

    7. In my opinion, we’d all do better not to use phrases like “I’m 95% confident,” because they tend to be interpreted intuitively rather than technically.

    8. I believe that Myth 3 would better be addressed by using the concept of “reference category” — that is, what is it that is random in talking about a probability. In the correct interpretation of CI it is the interval that is random; in the incorrect interpretation, it is the parameter. The concept of “reference category” is useful other places in discussing probability confusions. (e.g., an example based on one of Gigerenzer: If a physician says to a patient that taking a certain medication has a 40% probability of producing a certain side effect, the physician may mean that 50% of patients taking it experience that side effect, whereas the patient may interpret it as meaning that he will experience that side effect 40% of the time if he takes the medication.)

    9. “50% certainty” (p. 50) sounds weird to me.

    10. p. 6 — another problem with ambiguity of language: “generally apply to any CI” could mean apply to every CI, or could mean apply to most CI’s

    • 1. We did a survey of the literature in psychology and textbooks. You would, perhaps, be shocked at the pervasiveness of these claims. We will publish the results of this survey in another article, but our statement in this article is accurate as is (we could perhaps add “many” or even “most” and be satisfied that we are right, given our review).

      2. Yes (Neyman, 1937, bottom of page 348: http://www.jstor.org/stable/91337)

      3. You can extend the proof we offer in the supplement (https://www.scribd.com/doc/237420306/Supplement-to-The-Fallacy-of-Placing-Confidence-in-Confidence-Intervals) to larger N (the degrees of freedom on the chi sq distributions change). The key element in the proof is that Xbar and s^2 are independent, and that doesn’t change for larger N or different confidence coefficients.

      4. The small N is not the point here; if the model is true (and it is by stipulation), then the CI should work just fine, regardless of N. I agree that small sample sizes are problematic for other reasons, but the small N is a non sequitur here. We’re critiquing the *reasoning* behind common uses of CIs. That doesn’t change when N is small (assuming the model, and we are).

      5. We use “precision” in the way commonly used by proponents of CIs in the literature we work in.

      6. One problem is that proponents of CIs slip back and forth between the technical definition and the lay definition. We explicitly say this in the paper in Myth 3, but we could mention it in Myth 2 as well…

      7. Agreed. I think this is part of the allure of CIs; the very word “confidence” evokes epistemic ideas, but of course they cannot be used that way.

      8. I’m not sure what you mean. The relevant/recognizable subsets problem is a reference class issue. We explicitly say this in the paper (perhaps I’m misunderstanding what you mean)

      9. Not sure why.

      10. In context, it sounds like Cumming is saying that these apply to any CI, and in my conversations with him about this article he did not object to my characterization of his words.

      • Richard,

        Thanks for your point-by-point reply to my comments. Here are some comments on some of your replies:

        1. I hope you will supply a preprint of the lit survey for Andrew to make available to blog readers; however, I think my threshold for shock in what is in the literature by psychologists (and in what is in many textbooks) is gradually decreasing.

        2. Thanks for the precise reference. It sounds like Neyman could have been more careful in his wording.

        3. I now see from the supplement and comments you have made to others that (as you have indicated in another reply), the phrasing in your draft was confusing. However, I don’t think using “atypical” examples of confidence intervals are the best means of argumentation – it would be better to use examples consistent with the procedures that have been used where mistaken interpretations have occurred.

        4. This is moot since you had not made the model clear.

        5. Still, I think is is important to address different usages explicitly – I find it helpful in addressing misunderstandings.

        6. Again, I believe that explicitly pointing out different definitions/usages is important — and needs to be done repeatedly to make the point.

        8. I think we may mean (at least close to) the same thing by “reference category” (my term) and “reference class” (your term). What I was talking about is whether the reference class/category in the definition of confidence interval is “interval” (or, equivalently, and for some purposes important to emphasize, “sample”), or “possible value of parameter”. The first is the correct interpretation of a (frequentist) CI. The latter is correct for a Bayesian credible interval.

        9. See if this helps: 50% probability of an event expresses maximal uncertainty (equivalently, minimal certainty). So saying “50% certainty” seems like an oxymoron (hence, “weird”)

        10. If he did not object, then that does sound like your interpretation is indeed what he intended.

        • As to the point 2 quote from Neyman, he prefaces the whole thing by essentially saying that if this confidence procedure is applicable to the situation. In which case he’s correct. So one might argue the quote is out of context given how it’s being used.

        • I’m unclear how one might argue that. We merely state Neyman’s definition of a confidence procedure. There’s nothing “out of context” about it.

        • If that counts as relevant context, isn’t the statement a tautology? “The mean is 5 is a correct statement*” (…) “*: Where applicable.”

  12. A couple of things about this paper really confused me:

    – The authors quote a couple of statements about interpreting confidence intervals that include phrases like “in the absence of additional information”, then go on to argue that these interpretations are not correct because they don’t use such-and-such additional information, and in their arguments they seem to be acting as if that caveat were not there. I don’t see anything in this paper that contradicts that interpretation-with-caveat. Maybe most people forget about that “in the absence of additional information” bit in practice (this would not surprise me in the least) but that doesn’t seem to be the point they’re trying to make.

    – Why are the statisticians in the submarine example assuming a normal distribution for the bubbles when they know it’s uniform? The entire procedure of t-based CIs depends heavily on a normal population, which is obviously violated here. Is it generally assumed that “frequentist confidence interval” always assumes normality by definition? It seems to me that the “frequentist CI” procedure would use the actual sampling distribution for the mean of an n=2 sample drawn from a uniform distribution 10 m wide. The width of the 50% CI would be the width of the middle 50% of the sampling distribution, as always. One could calculate the width of the sampling distribution as a function of the sample width, and get a completely correct 50% confidence interval. (One could even come up with a 100% confidence interval using the same methods, obviously.) This example is an excellent illustration of why one should check their assumptions rather than blindly calculating things. But the authors seem to use this as an argument against CI’s in general. What am I missing?

    • Hooray, after 66 comments finally someone spotted that the submarine example is completely silly. :)

      Morey et al. demonstrate confidence intervals and cred intervals on the submarine problem by pairing these two interval procedures with two different models. The silly normal model is paired with freq CI while the appropriate uniform model is used with bayes CI. Of course the normal model fails miserably. They then claim that bayes CI is superior to freq CI.

      If you are interested, I wrote a blog post which gives the complementary two pairing – bayes with normal and freq with uniform. Of course, you get the reverse pattern. Bayes fails and freq CI get it right.

      • This is not accurate. A “normal model” was not used for the frequentist CI. CP1 is a “nonparametric” interval that works for the median when N=2 (that’s why it is the same as the Student’s t interval, NOT because a normal model was used). I agree that the line about the Student’s t interval is confusing in that section (and we’ll remove it to avoid further confusion), but the intended meaning there was simply that it is not an obviously “silly” interval. And it isn’t. You can see in the supplement that in frequentist terms, it performs as well as the Bayesian interval.

        Your blog post is simply wrong.

        • I’m sorry but this is not a matter of opinion. Derivation of conf intervals is an exact science. Spare yourself (and your coauthors!) the embarrassment and look up the definition of CI in a textbook. (If you don’t know any, I suggest Wasserman’s All of Statistics. The definition of CI is in section 6.3.2.) You will see that CI is defined in terms of the probability distribution over a parameter. Then uniform distribution is a valid and in your example the most reasonable option. Once you plug in the uniform distribution you will get a result that is identical to bayesian CI for uniform likelihood and flat prior.

        • You’re just plain wrong; there is not one way of generating confidence intervals. I cited the actual theoretical literature on CIs – some of which deals with the precise example we discuss – and you suggest that I’m embarrassing myself (you know Wasserman didn’t invent CIs, right?)? Read Welch (1939); are you going to tell me that he didn’t understand Frequentist CI theory, but you do? Welch explicitly says that the Bayes/likelihood interval is not the best way to generate a CI in the uniform example, from a frequentist point of view. He then gives an interval that dominates it in frequentist terms. Our CP2 also dominates the Bayes interval.

          Under CI theory, there is *not* one way to build a CI. What matters is coverage of the true value and exclusion of false values (in long-run terms). There are sometimes several ways to approach this, leading to the counterintuitive results we discuss.

          You should read Neyman (1937) and Welch (1939) before you post again.

          (I also noted with some amusement that you and Erf suggested that we didn’t use the “obviously correct” way of generating the interval, and then you two suggested two *different* intervals as obviously correct, both of which we mention in the paper.)

        • For your argument it is immaterial who invented the CI and what his opinion on this matter was – especially since this was more than a half a century ago. You need to engage modern definition of CI. Recent statistics textbooks are where you want to start.

          Further, for your argument it is immaterial that statistician X derived a CI (for the submarine example) by assuming Y, because you are not arguing against X and his derivation but against the concept of CI. If you want to put CI against credible interval you need to provide a CI derivation where your assumptions are close/identical to the assumptions made in the derivation of the credible interval. Your CP1 and CP2 fail to assume uniform distribution for x like your derivation of bayesian CI does .

          (Btw you won’t scare me by quoting dead people. I have no problem in stating that some of the work by Fisher, Neyman, Welch etc. was mistaken, wrong, confused, misguided, ignorant and naive. If they did make mistakes in derivation or did make implausible/unstated assumptions, too bad for them.)

        • I’m not trying to “scare” you by quoting from the actual statistical literature on this topic. If you are unwilling to interface with this literature — instead relying on an intro textbook that does not address the details of CI theory, and doesn’t even support your points insomuch as it discusses many different ways of constructing CIs — there’s not much else to be said.

        • You might also be interested in Fisher’s opinion on the matter. This is from the discussion on Neyman’s 1934 paper, where he first introduces the idea of the confidence interval (http://www.jstor.org/stable/2342192):

          “In particular, [Fisher, as opposed to Neyman] would apply the fiducial argument, or rather would claim unique validity for its results, only in those cases for which the problem of estimation proper had been completely solved, i.e. either when there existed a statistic of the kind called sufficient, which in itself contained the whole of the information supplied by the data, or when, though there was no sufficient statistic, yet the whole of the information could be utilized in the form of ancillary information. Both these cases were fortunately of common occurrence, but the limitation seemed to be a necessary one, if they were to avoid drawing from the same body of data statements of fiducial probability which were in apparent contradiction.

          “Dr. Neyman claimed to have generalized the argument of fiducial probability, and he had every reason to be proud of the line of argument he had developed for its perfect clarity. The generalization was a wide and very handsome one, but it had been erected at considerable expense, and it was perhaps as well to count the cost. The first item to which he would call attention was the loss of uniqueness in the result, and the consequent danger of apparently contradictory inferences.” (pp. 617-618)

          Fisher also understood that there is not *one unique* way to build a confidence interval. [It is worth noting that in the submarine/uniform case, the Bayes/likelihood interval can be obtained by conditioning on the ancillary statistic, and hence Fisher would identify the objective Bayes interval as the unique fiducial interval. It is also worth noting that Fisher explicitly notes there that Neyman’s theory *does not require this*.]

        • Matus:

          You write, “Derivation of conf intervals is an exact science.” I’m not quite sure what is meant by “exact science” in this context but I don’t think your description is accurate. What, for example, is the exact science behind the derivation of confidence intervals for logistic regression coefficients?

          The derivation of confidence intervals is an exact science in some simple examples but not in general.

        • Exact science: If you give me P_theta(D|theta), D and specify alpha I will give you the alpha CI (a,b). You may fail to provide closed form solution for P_theta(D|theta) for a particular model (and take recourse to an approximate solution) but that does not make the particular step where we derive of from P_theta any way less exact. Similar, you would not describe the second law of motion as approximate or not exact just because you can’t measure mass with infinite precision or because you can’t derive a closed form solution for a complex model.

        • Matus:

          No, you misunderstand. I’m not talking about a lack of a closed-form solution, I’m saying that there is no solution at all.

          Here’s a simple example: you have data y_1,…,y_n from the logistic regression model, Pr(y_i=1) = invlogit(X_i*b), where X_i is an n*k matrix. Let’s say n=100 and k=10, just to be specific, and let’s also specify that X has no multicollinearity. Also, just to be specific, suppose we want a 95% confidence interval for b_1. There is no general procedure for defining such an interval, not in the sense that you mean, in which the interval is obtained by inverting a hypothesis test.

          Nonetheless, practitioners want such intervals (in part, I’d argue, because they’ve been misled by the confident tones of statistics textbooks, but that’s another story). So procedures exist. But it’s not an exact science, it’s a bunch of rules and approximations.

          That’s ok, not everything has to be an exact science. Even if something is an exact science, it depends on assumptions that we don’t in general believe (in your example, you want me to give you P_theta(D|theta) but in any real example I’ve ever seen, this probability distribution is only a convenient approximation).

          So I don’t think it’s a devastating criticism on my part to say that hypothesis-test-inversion confidence intervals are not an exact science. They represent a statistical method that works in some important special cases, as well as a principle that can be applied with varying success in other situations. And that’s fine. The mistake has not been in people coming up with and using this method, the mistake is when people treat it as a universal principle for interval estimation.

        • Andrew,

          I just split what you call CI derivation into two parts – I. derivation of P(D|theta) and II. the derivation of the CI from it. The former part is difficult for complex models, while the latter is trivial and exact once we have P(D|theta). I called the latter part CI derivation. My use of the term is narrower. I now realize this is confusing, since frequentists will work both on I. and II. when they try to improve calibration of their CI procedure. However, I. is not only about the long run performance. Assumptions and background knowledge about the data generating process enter here. I wanted to focus the discussion with RM on part I und put part II aside.

        • The post may be incorrect about the specific assumptions of your CP1 and CP2 but it’s not incorrect that there are perfectly valid CIs for the situation and your paper only works if the CI researcher is an idiot using inappropriate CIs and the Bayesian isn’t. matus just reversed the idiots.

        • Our whole point is that yes, the CIs are inappropriate. We actually discuss the “valid” CI for this example in the paper. Its validity rests on its likelihood/Bayes properties, NOT frequentist properties, because the “valid” interval has suboptimal frequentist properties. If you think anyone who uses CP2 over the objective Bayes interval is an idiot, you’re calling Neyman and Welch (and many other frequentist) idiots.

        • You’re making a pretty bold assumption that Neyman would use either of those CIs out there on the high seas. I think you’re probably wrong on that.

    • Hi Erf, in response to your first point, in all the examples we give the necessary “extra” information can be obtained from the confidence interval itself. So the “in the absence of additional information” hedge doesn’t work. In response to the second point, they’re not assuming a normal distribution. If I understand what you’re suggesting, you’re suggesting what we call CP2. As described in the paper, this confidence procedure has strange properties (includes impossible intervals, doesn’t track precision of estimate; see also the supplement).

      In none of the examples were any assumptions violated; all confidence intervals were valid confidence intervals for the probability model. The point is that the “confidence” property does not imply that the resulting interval has any good use. Even in cases of “good” frequentist intervals (as Neyman would define them), you get counter-intuitive weirdness (see the supplement; CP2 dominates the likelihood-derived interval in frequentist terms, but CP2 leads to absurd intervals).

      • Ah, I see what you’re saying about “additional information”. I think you’re misinterpreting what’s meant by that, though. If all you tell me is that you have a 50% CI, then from my point of view there is a 50% probability that your CI contains the true value. If you also tell me other information, such as the width of the CI, or known flaws in the model you used to produce it, etc, then I can use this “additional information” to come up with a conditional probability that your CI contains theta _given_ that information. Which, I agree, should be done when possible. But this doesn’t mean the “FCF” is a fallacy. (Your example of a 9-m CP1 interval on p.5 is a case of additional information: the size of the submarine and the shape of the bubble distribution, neither of which are used by CP1.)

        (I wonder how much of this disagreement is due to a semantic or philosophical difference in the definition of “probability”. I assert that e.g. “in the absence of any other information, there is a 95% probability that the obtained confidence interval includes the population mean” is completely correct and consistent as a statement about frequentist probability.)

        Related to this, you’re very concerned in your paper that there’s more than one “valid” CP, and each one gives a different CI; you seem to think this invalidates the statement that a given CI has a 50% chance of containing theta. But if you perform two Bayesian analyses using different priors and different probability models, obviously you’ll get different posteriors; wouldn’t that mean that you can’t interpret posteriors as likelihoods either?

        Thank you for clarifying where CP1 and CP2 come from. This example still doesn’t show what you claim, though, because the biggest difference between the three methods (CP1, CP2, and CredInt) is not in whether they’re Bayesian or frequentist but in how much information each one uses.

        The Bayesian credibility interval (CredInt) that you give uses the length of the submarine, the fact that the bubble distribution is uniform along that length, and the separation between the bubbles; basically all of the available information.

        It’s easy to construct a frequentist confidence interval procedure that uses all this as well (and it basically follows the argument you describe in your supplement for CredInt): Let dx=|x1-x2| be the separation between the bubbles. The farthest the mean of this sample could be from theta is (5-dx)/2, so the sampling distribution of the mean is a uniform distribution centred at theta with width (10 – dx). Thus the 50% CI is x-bar +/- (5-dx)/2. Which of course is the same as your CredInt, but arrived at using purely frequentist methods.

        CP2 uses the submarine length and the uniform distribution information, but throws away the bubble separation, so obviously it’s going to perform more poorly. Could Bayesian methods do any better without using the bubble separation?

        CP1 does use the bubble separation, but it doesn’t use any information about the submarine at all! If you don’t know whether the length of your submarine is 10mm, 10m, or 10km, it shouldn’t be a surprise that the CI you get from 2 data points is not that useful — and I don’t see how any method could give you a better idea of your measurement precision in that case! This seems to me to illustrate a serious problem with trying to use non-parametric methods with tiny sample sizes, but it doesn’t say anything about CIs in general. What would an _equivalent_ Bayesian credibility interval look like if the ONLY information it’s allowed to use is x1 and x2 (nothing about the bubble probability distribution, submarine size, etc)?

        There are certainly many situations where Bayesian methods are the easiest and/or best ways to incorporate information. But I think the reason your submarine example strikes people as silly is because in this case there’s a very straightforward frequentist CP that you’re ignoring, which undermines your entire argument.

        • You’re missing the fact, which I’ve repeated several times here, that CP2 (and another CP derived by Welch) have better frequentist properties than the credible interval and would be preferred by a frequentist. We explicitly say that the objective Bayes interval is a 50% CP. But it isn’t preferred. The preferred intervals lead to absurdities. I dislike being told over and over how obvious it is that frequentists would use the likelihood/Bayes interval when I’ve cited one of the most important frequentists of the 20th century saying otherwise, and giving explicit frequentist reasoning.

        • Suppose I’m a frequentist. You come to me and say, “hey, we have two models we can consider for the process that generated these bubbles. One of them is completely implausible given everything we know about our lost submarine, but has nice frequentist properties if we completely ignore all of the specific information we have here; the other is almost certainly the correct model, but would produce some very strange results in some hypothetical situations that are nothing like the current one.”

          It seems you’re claiming that, as a frequentist, I’m obligated to choose the latter model because frequentists aren’t allowed to think about, or take into account, the likelihood of different data-generating processes. This hardly seems fair; as matus pointed out in his/her blog post, you’re basically pitting a dumb frequentist against a clever Bayesian. Obviously, if you stipulate that only Bayesians are allowed to take into account any kind of contextual information about a problem, then frequentism is going to get it wrong much of the time. But that seems like a rather odd view. Surely one doesn’t have to be a Bayesian to realize that it’s a bad idea to privilege a model that makes no logical sense over one that does, right? (Or, to put it differently, if you think that one does have to be a Bayesian for that, then I think you will be surprised at the number of people who are happy to be called Bayesians even though it apparently has no implications whatsoever for the way they do their analysis.)

        • Tal:

          I can’t speak for Morey et al., but I think the key confusion here is that they are not saying that all confidence intervals are bad (certainly not that all frequentist methods are bad, given that any Bayesian procedure can be interpreted as “frequentist,” as all that this means is that various theoretical properties of a method are evaluated). What they are saying is that “confidence intervals” do not represent a general principle of interval estimation. This is a point that may well be obvious to you but it not always clear in textbooks. The issue is not that they are picking a bad frequentist method, the issue is that they are pointing out that a procedure that is sometimes recommended as a good general principle, actually can have some big problems. From the perspective of the user, what’s important is to understand where these methods have such problems.

          I find this general approach to inquiry—take a generally-recommended principle and explore simple special cases where it fails miserably—to often be a helpful way to gain understanding. Indeed, I apply this approach myself in criticizing the noninformative Bayesian approach (which I, to my embarrassment, recommend in my textbooks) in the second-to-last paragraph of my above post.

        • I agree completely with the conclusion you ascribe to Morey et al, Andrew, but I think they clearly go well beyond that in the paper. For instance, even in the abstract, they claim that “CIs do not necessarily have any of these properties, and generally lead to incoherent inferences”. That’s a very strong claim, and doesn’t seem accurate to me. I think it would have been better to say that in most cases naive CIs will lead to coherent inferences, but that there is a non-negligible set of cases where one is liable to draw very wrong conclusions if one is not careful. Of course, this is true of just about any model, whether frequentist or Bayesian.

          I think the point you’re making could have been much more simply demonstrated in the paper without introducing Bayes at all–e.g., by comparing the naive frequentist CI to the “correct” one (which could just as easily be presented in its frequentist flavor). It strikes me as quite misleading for the authors to claim that Bayesian approaches solve this particular problem, when as others have noted above, one could have come to the right inference using a different frequentist CI (or, conversely, arrived at the wrong inference with a different Bayesian model).

        • Andrew, have you read the paper? They are actually saying that all CIs are bad, that they should be abandoned and that bayesian intervals should be used instead. Here is their conclusion:

          “We have suggested that confidence intervals do not support the inferences that their advocates believe they do. The problems with confidence intervals – particularly the fact that they can admit relevant subsets – shows a fatal flaw with their logic. They cannot be used to draw reasonable inferences.We recommend that their use be abandoned.” (p.9)

          Is this to what you subscribe to?

          “I find this general approach to inquiry—take a generally-recommended principle and explore simple special cases where it fails miserably—to often be a helpful way to gain understanding.”

          How about a following inquiry into the submarine problem:

          It is highly unlikely that the lost submersible has been carried 1000 miles away from the point of initial submersion point. Therefore a gaussian prior with mean at the initial submersion point is a much better choice. Then we choose the likelihood to be a conjugate distribution (ie gaussian) as is advocated in your textbook (BDA 2ed, chap 3.3). We add prior for sigma and marginalize over it. Then we obtain student’s t distribution for posterior. From this posterior we obtain a 50% credible interval that is similar (up to the prior) to CP1 in Morey et al.

          Does this show that bayesian CIs should be abandoned? Do you feel the need to rewrite BDA, such that it says in chapter 3.3 that uniform likelihood should be used instead of normal when estimating position of submarines, whales and farting divers? Do you find this inquiry illuminating? I do not. I find it silly. CP1 is similarly silly except it has frequentist flavour.

          While you point out in your post that bayesian CIs have their own problems, Morey et al. will have none of it. They don’t find any fault with the bayesian CIs – they certainly don’t point out any problems in the paper. Instead we learn that “by adopting Bayesian inference, [researchers] will gain a way of making principled statements about precision and plausibility.” (p.9) And you were complaining about the overconfident tone of statistics textbooks when they describe CIs…

        • Matus, Andrew did read the paper, and is correctly characterising our argument. You should read the rest of the paper, including the part where we discuss cases where a proper Bayesian interval will be numerically the same as the confidence interval. Anything but Andrew’s interpretation would be nonsensical given the rest of the paper, though the paper is a draft so we can make it clearer to avoid misunderstandings…

        • @RM: I did read an earlier version of the manuscript (linked in my blog post). I have spent more time on that manuscript than I’m willing to admit – double-checking and triple-checking the calculations. Now, you seem to say that your conclusion in the manuscript (which I cited, and TY points out another section) does not characterize the paper’s argument. I guess I will just wait until a coherent version of the paper gets published…

        • Frequentist theory is a principled theory of inference; it is not “whatever makes sense to you”. It has implications, which we are highlighting. In order to say that the frequentist in our example is “dumb” you have to call Welch and Neyman “dumb”. Either they are “dumb frequentists” or you misunderstand the theory.

          Frequentism, as a theory, cares about different things than you might expect. The implications of this may not make sense to you, but the answer is not to deny that those are in fact, implications. This stuff has been known for a long time, but has gotten little attention outside theoretical statistics. We aim to change that.

        • Richard:

          I almost agree with you here. But let me emphasize that one key aspect of frequentist theory is that it includes many different principles which can contradict each other. Frequentist principles include unbiasedness, efficiency, consistency, and coverage. Careful frequentists realize that no single principle can work, and they recognize that different principles are more or less relevant in different settings, even if this is not always clear in textbook presentations.

        • Well, I’m happy to take your word for this, but then the upshot is that you’re arguing against something that almost nobody actually cares about in practice. If you really believe that anyone who thinks it’s a good idea to take into account the length of the sub and the uniformity of the bubble distribution can’t possibly be a frequentist, then it seems to me that you’re rendering the term ‘frequentist’ largely useless in modern discourse. I think you’ll have a very hard time finding people to endorse the view you describe (i.e., that anyone who’s a frequentist must reject the obviously correct “frequentist” CI here in favor of suboptimal alternatives).

          If anything, I think you will probably make many people very happy, since as I understand it, you’re basically saying that anyone can be a Bayesian without ever formally integrating a prior–all one has to do is occasionally think about the plausibility of the models one is testing, and then it doesn’t really matter how those models are formalized beyond that.

          Or is there some third label you think we should apply to someone who has no problem using a confidence (and not credible) interval that doesn’t have the “preferred frequentist properties” even if it’s clearly the most sensible model?

        • Andrew: Yes, it is true that “frequentism” can be a bit of a grab bag (particularly if one counts Fisher as a frequentist, which is at least debatable). In the context of our paper, however, by “frequentism” we mean Neyman’s theory, since that is the theory that birthed CIs. Thankfully, that is much less of a grab bag since Neyman was so careful and principled (re: your “foxhole” comment, if anyone would use a counter-intuitive CI on the high seas, it would be him! Of course, Fisher would just say that Neyman didn’t get out on the high seas enough…)

          Tal: Bayesianism and frequentism aren’t exhaustive. “Fiducial” or “likelihoodist” may be the third label you’re looking for (depending).

        • From what I can get out of Welch (1939) in a quick read (I don’t have a lot of time today), CP2 would be preferred over “CP3” (which can be derived using purely frequentist methods) because, for any given “false” value (theta+delta), CP2 would be (very slightly) less likely to include this false value than CP3 would. Is that what you’re thinking of? Why are you assuming that this is the only valid “frequentist principle” by which one may choose a CP?

          What it comes down to is that your paper is arguing that frequentist confidence intervals are completely broken and should never be used, but the examples you use to support this don’t support it at all. If instead you were making the point that some of the recommended criteria used to choose a particular CP can sometimes ignore important information and lead to poor (or even “absurd”) CIs, or that one must be careful to consider assumptions and available information when interpreting CIs, then I don’t think most people would be complaining. (The same is true of Bayesian inference!) But nowhere _in your paper_ does it say why a frequentist statistician would necessarily come up with CP1 or CP2 but never CP3. Instead, it argues that because this particular Bayesian CredInt performs so much better than these two particular frequentist CIs, therefore CIs are all fundamentally broken and should never be used. This makes absolutely no sense.

          The problems your paper ascribes to CIs could all apply to Bayesian inference methods, if those methods were restricted to use (or throw away) the same information. (For example, if you build a Bayesian credibility interval based on a non-parametric probability model and use no information at all about the submarine, wouldn’t the result have all of the same problems you ascribe to CP1 throughout the paper?)

          The issues you address are real problems, obviously. But they are not problems with CIs or CPs, but with applying and interpreting statistical procedures without thinking.

        • Erf: It similar to Simpson’s paradox used to draw attention to confounding, where the reversal is dramatic but the important point is just any confounding.

          In Morey’s paper (which does a good job of making obscure technical material accessible) the important point is relevant subsets – the submarine example just makes the point dramatic. Relevant subsets are a very serious problem for frequentist methods, they stopped Fisher in his tracks, and for instance, Mike Evans in his rejoinder to Mayo in her recent likelihood principle paper raised it as the serious unresolved problem.

          Interestingly, George Casella in his paper on relevant subsets suggested they were not likely to be of much practical importance. I raised this with him in 2008, as I found them very important in my practical work and so has Stephen Senn, and he assured me he had changed his mind.

          Now formally, Bayes does avoid the problem – but only if one nevers questions their joint model, prior nor likelihood. For instance Box and Rubin’s work on model checking and calibrating Bayes or Steve MacEachern John Lewis (2014) (with Yoonkyung Lee), Bayesian restricted likelihood methods (not conditioning on all the data for good reasons).

        • Oh, I agree that relevant subsets are a serious problem. (And thanks for the additional background!) What I’m having a problem with is the way the paper leaps from that to a blanket condemnation of CIs as a methodology. If the problem is fundamental to CIs in a way that Bayesian inference is somehow not, the paper does not illustrate or support this; the examples given ignore available frequentist solutions to the relevant subset problem and ignore possible Bayesian solutions that would have the same problems.

  13. RE Confidence intervals and inversion of testing procedures. Comments above: http://statmodeling.stat.columbia.edu/2014/12/11/fallacy-placing-confidence-confidence-intervals/#comment-202608 ran out of nesting-depth. Here’s my take on that:

    Andrew mentions an interval for logistic regression coefficients constructed as +- 2 standard errors from the point estimate. I haven’t looked at the math there, but my intuition is that this is an *approximate* interval made by assuming approximate normality of the coefficient’s sampling distribution.

    Andrew also says: “I don’t think inversion of tests is in general a good way to obtain such intervals”

    But, logically, and mathematically, every interval corresponds to a test: We can see this as follows:

    Suppose your hypothesis is that a parameter value is x, take a sample, form a confidence interval using a confidence procedure that has N% coverage, reject the hypothesis if x is not contained in this interval. This is always a valid hypothesis test, because it rejects the hypothesis incorrectly only N% of the time. (NOTE: all of this is under the assumption that the assumptions that go into the procedures hold exactly, ie, we’re dealing with a random number generator of known distributional form).

    I think the point of this is that Andrew is thinking of hypothesis tests as a bestiary of known procedures that you can look up in a book or on wikipedia or whatever, whereas logically, *anything* that incorrectly rejects a true hypothesis only N% of the time when you use it on data that meets the assumptions is in fact a valid test.

    • some places I should have said (1-N)%… deal with it ;-)

      This reminds me of an engineering problem. You place a fixed unit load at point X and you want to know the stress at point Y. S(X,Y) can be defined as a function of two variables. But… often we’re interested in say just maximum stresses for a given load. So you can do:

      S*(X) = S(X,Y*) where Y* is the location where S is maximized for the given X…

      or, you could say, what location can we put a given load so that we get the maximum stress at a given Y.

      L(Y) = X: S(X,Y) is maximized

      or, you could say, what is the maximum possible stress that can be at any given location caused by putting a known load at any unknown location:

      S*(Y) = S(X,Y) when X is at X* which maximizes S for the given Y…

      etc etc, All of these things are different ways of looking at particular aspects of one multi-dimensional phenomenon. Testing and confidence intervals feel like they have this characteristic, except that instead of a fixed S function, people invent their own functions for each test/confidence interval procedure.

    • Dan:

      Yes, I’m more interested in confidence-intervals-as-they-are-used-and-interpreted-in-practice, I’m not so interested in inverting hypothesis tests. I guess that, to be precise, you could say that I’m talking about “interval estimation” and Larry is talking about “hypothesis inversion confidence intervals.” The trouble is that the phrase “confidence interval” has different meanings for different people in different contexts.

      This is one of the points I take from the Morey et al. paper: if all that people ever did with confidence intervals was to interpret them as inversions of hypothesis tests, things wouldn’t be so bad: in that case, “confidence intervals” would be a very narrow, specialized statistical technique only used in certain settings. But, in practice, confidence intervals are used as “interval estimation” in the more general sense to convey precision and uncertainty.

      And this use is encouraged in statistics texts: the idea that is sometimes presented is that if you get confidence intervals by inverting hypothesis tests, you get an interval that conveys precision and uncertainty. But that’s not the case. Sometimes these hypothesis-test-inversion confidence intervals give good interval estimates, sometimes they don’t.

      • P.S. to both Dan and Larry:

        I agree that Larry has a point that there’s an advantage to using precise terminology and going from there. Thus, “confidence interval” = “inversion of hypothesis test” and we’re done (but with lots of technical challenges with discrete data, nuisance parameters, and all the rest).

        My problem with this is that “confidence interval” also has this other usage, corresponding to “uncertainty interval” that has all these other implied properties.

        I take Larry’s point to be that these penumbral implications of the phrase “confidence intervals” are perhaps unfortunate but should be irrelevant to any technical discussion. But my point is that, ultimately, two important part of any discussion of a statistical procedure are: (a) how the procedure is used, and (b) what people think it is doing. As Morey et al. discuss, people widely think that confidence intervals have all sorts of impossible properties.

      • “the idea that is sometimes presented is that if you get confidence intervals by inverting hypothesis tests, you get an interval that conveys precision and uncertainty”.

        I think this is true (that it is presented this way) and I think the reason is that in many cases confidence intervals look a lot like Bayesian HPD intervals, because the confidence procedure is mathematically identical to a Bayesian procedure with a flat prior. Although it often happens, it doesn’t have to be the case, and that is a pretty subtle point that was missed several times in above conversations. One thing about the submarine example is that it uses a perfectly valid confidence procedure which nevertheless doesn’t correspond to anything like what a Bayesian procedure would look like, and hence the results are different. This choice is a lot more general than just that example, I’m thinking about things like resampling / bootstrap, permutation tests, and soforth which are advocated as frequentist non-parametric methods.

        One thing that a Bayesian procedure gives you is a guarantee that you’ll never have an interval that contains a logically impossible set of values (provided you construct your Bayesian interval with the logically impossible values excluded from your prior, sometimes it’s easier if you don’t do that, and doesn’t matter in the end anyway)

        The other thing that a Bayesian procedure gives you is flexibility in describing your data generating model, including things like serial correlations in data collection, multiple groupings, partially overlapping groupings, or whatever.

  14. My take on the paper is that it’s vastly overreaching and tries to extract too much from a silly example. My suggestion, if the authors wish to make the point that Bayesian intervals are better than CIs, is to focus on some of the logical issues here and in better examples and suggest that yes, a reasonable researcher using CIs can avoid the problem by considering information outside a simple CI calculation but all of that information is what goes into being a Bayesian anyway, so why not be formally Bayesian?

    As it is the paper contains a relatively points out two issues that could arise in rare instances when, if they’re big issues they’ll generally be pretty obvious, and then rails agains the FCF that’s known to be false anyway. While the FCF may have proponents in their field, there are CI proponents who know the FCF false and it doesn’t dissuade them.

    I’d be much happier with the paper if they changed their use of the word “general” in many places to the word “sometimes”.

  15. Pingback: Somewhere else, part 192 | Freakonometrics

  16. I’m coming late to this discussion, and have only skimmed it, but I see I’m not alone in thinking the submarine example ludicrous as an argument for not using confidence intervals. It is worth noting that Fraser dealt with the Welch example in “Ancillaries and Conditional Inference,” Statistical Science, vol 19 no 2, 2004. The absolute distance between the two observations, call it R, provides an ancillary statistic that should be conditioned on. The b-level confidence interval conditioning on R is the midpoint of the observations plus or minus beta(1-R)/2, which is large when the two observations are close together and shrinks toward zero as R approaches its maximum. This is what common sense would dictate, as would, I hope, any statistician. Fraser notes that alternative approaches to confidence intervals that seek some sort of optimum coverage have “unpleasant properties that hopefully would not easily be explained away to a client.”

    Link to the Fraser article: http://projecteuclid.org/download/pdfview_1/euclid.ss/1105714167

      • Spanos’s proposed truncation fix falls to a minor modification of the Uniform model. Suppose that the values of X are sampled from the Uniform and then observed with normally distributed error with tiny variance. Then the unconditional CI includes, not impossible parameter values, but rather very very improbable values (improbable in what sense? Bayesian, natch) but since the data do not strictly rule them out, Spanos’s set of possible parameter values A(x_0) is again the entire real line, and no strict truncation is justified.

    • Yes, the submarine example can be nicely dealt with (actually when I was in Don Fraser’s class Welch’s example was the canonical example) but there is (yet) no general (satisfactory) theory that comes out of it.

      The challenge of what should you condition on in a given problem to get a unique reference set remains unanswered.

  17. Pingback: How to interpret confidence intervals? - Statistical Modeling, Causal Inference, and Social Science

Leave a Reply

Your email address will not be published. Required fields are marked *