The challenge of what should you condition on in a given problem to get a unique reference set remains unanswered.

]]>Link to the Fraser article: http://projecteuclid.org/download/pdfview_1/euclid.ss/1105714167

]]>Not sure why many writers seem to be able able to avoid giving this false sense of “Bayesians having the complete solution”.

]]>In Morey’s paper (which does a good job of making obscure technical material accessible) the important point is relevant subsets – the submarine example just makes the point dramatic. Relevant subsets are a very serious problem for frequentist methods, they stopped Fisher in his tracks, and for instance, Mike Evans in his rejoinder to Mayo in her recent likelihood principle paper raised it as the serious unresolved problem.

Interestingly, George Casella in his paper on relevant subsets suggested they were not likely to be of much practical importance. I raised this with him in 2008, as I found them very important in my practical work and so has Stephen Senn, and he assured me he had changed his mind.

Now formally, Bayes does avoid the problem – but only if one nevers questions their joint model, prior nor likelihood. For instance Box and Rubin’s work on model checking and calibrating Bayes or Steve MacEachern John Lewis (2014) (with Yoonkyung Lee), Bayesian restricted likelihood methods (not conditioning on all the data for good reasons).

]]>What it comes down to is that your paper is arguing that frequentist confidence intervals are completely broken and should never be used, but the examples you use to support this don’t support it at all. If instead you were making the point that some of the recommended criteria used to choose a particular CP can sometimes ignore important information and lead to poor (or even “absurd”) CIs, or that one must be careful to consider assumptions and available information when interpreting CIs, then I don’t think most people would be complaining. (The same is true of Bayesian inference!) But nowhere _in your paper_ does it say why a frequentist statistician would necessarily come up with CP1 or CP2 but never CP3. Instead, it argues that because this particular Bayesian CredInt performs so much better than these two particular frequentist CIs, therefore CIs are all fundamentally broken and should never be used. This makes absolutely no sense.

The problems your paper ascribes to CIs could all apply to Bayesian inference methods, if those methods were restricted to use (or throw away) the same information. (For example, if you build a Bayesian credibility interval based on a non-parametric probability model and use no information at all about the submarine, wouldn’t the result have all of the same problems you ascribe to CP1 throughout the paper?)

The issues you address are real problems, obviously. But they are not problems with CIs or CPs, but with applying and interpreting statistical procedures without thinking.

]]>“We have suggested that confidence intervals do not support the inferences that their advocates believe they do. The problems with confidence intervals – particularly the fact that they can admit relevant subsets – shows a fatal flaw with their logic. They cannot be used to draw reasonable inferences.We recommend that their use be abandoned.” (p.9)

Is this to what you subscribe to?

“I find this general approach to inquiry—take a generally-recommended principle and explore simple special cases where it fails miserably—to often be a helpful way to gain understanding.”

How about a following inquiry into the submarine problem:

It is highly unlikely that the lost submersible has been carried 1000 miles away from the point of initial submersion point. Therefore a gaussian prior with mean at the initial submersion point is a much better choice. Then we choose the likelihood to be a conjugate distribution (ie gaussian) as is advocated in your textbook (BDA 2ed, chap 3.3). We add prior for sigma and marginalize over it. Then we obtain student’s t distribution for posterior. From this posterior we obtain a 50% credible interval that is similar (up to the prior) to CP1 in Morey et al.

Does this show that bayesian CIs should be abandoned? Do you feel the need to rewrite BDA, such that it says in chapter 3.3 that uniform likelihood should be used instead of normal when estimating position of submarines, whales and farting divers? Do you find this inquiry illuminating? I do not. I find it silly. CP1 is similarly silly except it has frequentist flavour.

While you point out in your post that bayesian CIs have their own problems, Morey et al. will have none of it. They don’t find any fault with the bayesian CIs – they certainly don’t point out any problems in the paper. Instead we learn that “by adopting Bayesian inference, [researchers] will gain a way of making principled statements about precision and plausibility.” (p.9) And you were complaining about the overconfident tone of statistics textbooks when they describe CIs…

]]>I just split what you call CI derivation into two parts – I. derivation of P(D|theta) and II. the derivation of the CI from it. The former part is difficult for complex models, while the latter is trivial and exact once we have P(D|theta). I called the latter part CI derivation. My use of the term is narrower. I now realize this is confusing, since frequentists will work both on I. and II. when they try to improve calibration of their CI procedure. However, I. is not only about the long run performance. Assumptions and background knowledge about the data generating process enter here. I wanted to focus the discussion with RM on part I und put part II aside.

]]>Tal: Bayesianism and frequentism aren’t exhaustive. “Fiducial” or “likelihoodist” may be the third label you’re looking for (depending).

]]>This somehow reminds me of the foxhole fallacy.

]]>For simpler concepts & situations it may indeed work. But there’s no mistaking that the *real* physics is described by a precise math equation. And whatever translation into sentences you do is only an approximation and in some limiting case it always fails.

So also, for NHST, the biggest problems seem to arise when someone tries to translate a complex situation into “simple” sentences. That’s where the biggest errors & pitfalls lie.

]]>“It must be stressed, however, that having seen the value x, NP theory never permits one to conclude that the specific confidence interval formed covers the true value of 0 with either (1 – alpha)100% probability or (1 – alpha)100% degree of confidence. Seidenfeld’s remark seems rooted in a (not uncommon) desire for NP confidence intervals to provide something which they cannot legitimately provide; namely, a measure of the degree of probability, belief, or support that an unknown parameter value lies in a specific interval. Following Savage (1962), the probability that a parameter lies in a specific interval may be referred to as a measure of final precision. While a measure of final precision may seem desirable, and while confidence levels are often (wrongly) interpreted as providing such a measure, no such interpretation is warranted. Admittedly, such a misinterpretation is encouraged by the word ‘confidence’. ” (p 272)

Mayo is here is accusing Seidenfeld of committing what we call the fundamental confidence fallacy. (In our article, we also make the point that the FCF is encouraged by the word ‘confidence’). Mayo’s remarks about the CIs being inappropriate for “final” precision can of course be extended to what we call the likelihood and precision fallacies, since those are “final” judgments as well, and CIs are just as inappropriate for these. So it seems that Mayo agrees with us on our three fallacies (unless I misunderstand her, or she has changed her mind).

In response to Seidenfeld’s counter-intuitive example that showed a similar disconnect between “confidence” and “final precision”, Mayo writes:

“To this the NP theorist could reply that he never intended for a confidence level to be interpreted as a measure of final precision; and that he never attempted to supply such a measure, believing, as he does, that such measures are illegitimate. It is not the fault of NP theory that by misinterpreting confidence levels an invalid measure of final precision results.” (p 273)

This echoes (or, rather, we echo this, since she was first…) our statement that “Confidence procedures were merely designed to allow the analyst to make certain kinds of dichotomous statements about whether an interval contains the true value, in such a way that the statements are true a fixed proportion of the time *on average* (Neyman, 1937). Expecting them to do anything else is expecting too much.” As Bayesians, we disagree with Mayo on the value of measures of final precision, of course, but we definitely agree on the problem of misinterpreting confidence intervals.

Finally, Mayo also writes:

“In my own estimation, the NP solution to the problem of inverse inference can provide an adequate inductive logic, and NP confidence intervals can be interpreted in a way which is both legitimate and useful for making inferences. But much remains to be done in setting out the logic of confidence interval estimation before this claim can be supported — a task which requires a separate paper.” (p 273)

Mayo, almost 50 years after the theory of CIs was laid out, is saying that “much remains to be done” before one can support the statement that NP confidence theory is “both legitimate and useful for making inferences.” I found this to be a fairly staggering statement from a frequentist, to be sure. We disagree with Mayo’s estimation that this can be done — and are unaware of any work after in 32 years after this paper that did — but it would certainly be valuable work trying.

]]>If anything, I think you will probably make many people very happy, since as I understand it, you’re basically saying that anyone can be a Bayesian without ever formally integrating a prior–all one has to do is occasionally think about the plausibility of the models one is testing, and then it doesn’t really matter how those models are formalized beyond that.

Or is there some third label you think we should apply to someone who has no problem using a confidence (and not credible) interval that doesn’t have the “preferred frequentist properties” even if it’s clearly the most sensible model?

]]>I almost agree with you here. But let me emphasize that one key aspect of frequentist theory is that it includes many different principles which can contradict each other. Frequentist principles include unbiasedness, efficiency, consistency, and coverage. Careful frequentists realize that no single principle can work, and they recognize that different principles are more or less relevant in different settings, even if this is not always clear in textbook presentations.

]]>Frequentism, as a theory, cares about different things than you might expect. The implications of this may not make sense to you, but the answer is not to deny that those are in fact, implications. This stuff has been known for a long time, but has gotten little attention outside theoretical statistics. We aim to change that.

]]>I think the point you’re making could have been much more simply demonstrated in the paper without introducing Bayes at all–e.g., by comparing the naive frequentist CI to the “correct” one (which could just as easily be presented in its frequentist flavor). It strikes me as quite misleading for the authors to claim that Bayesian approaches solve this particular problem, when as others have noted above, one could have come to the right inference using a different frequentist CI (or, conversely, arrived at the wrong inference with a different Bayesian model).

]]>I can’t speak for Morey et al., but I think the key confusion here is that they are not saying that all confidence intervals are bad (certainly not that all frequentist methods are bad, given that any Bayesian procedure can be interpreted as “frequentist,” as all that this means is that various theoretical properties of a method are evaluated). What they are saying is that “confidence intervals” do not represent a general principle of interval estimation. This is a point that may well be obvious to you but it not always clear in textbooks. The issue is not that they are picking a bad frequentist method, the issue is that they are pointing out that a procedure that is sometimes recommended as a good general principle, actually can have some big problems. From the perspective of the user, what’s important is to understand where these methods have such problems.

I find this general approach to inquiry—take a generally-recommended principle and explore simple special cases where it fails miserably—to often be a helpful way to gain understanding. Indeed, I apply this approach myself in criticizing the noninformative Bayesian approach (which I, to my embarrassment, recommend in my textbooks) in the second-to-last paragraph of my above post.

]]>Further, for your argument it is immaterial that statistician X derived a CI (for the submarine example) by assuming Y, because you are not arguing against X and his derivation but against the concept of CI. If you want to put CI against credible interval you need to provide a CI derivation where your assumptions are close/identical to the assumptions made in the derivation of the credible interval. Your CP1 and CP2 fail to assume uniform distribution for x like your derivation of bayesian CI does .

(Btw you won’t scare me by quoting dead people. I have no problem in stating that some of the work by Fisher, Neyman, Welch etc. was mistaken, wrong, confused, misguided, ignorant and naive. If they did make mistakes in derivation or did make implausible/unstated assumptions, too bad for them.)

]]>It seems you’re claiming that, as a frequentist, I’m obligated to choose the latter model because frequentists aren’t allowed to think about, or take into account, the likelihood of different data-generating processes. This hardly seems fair; as matus pointed out in his/her blog post, you’re basically pitting a dumb frequentist against a clever Bayesian. Obviously, if you stipulate that only Bayesians are allowed to take into account *any* kind of contextual information about a problem, then frequentism is going to get it wrong much of the time. But that seems like a rather odd view. Surely one doesn’t have to be a Bayesian to realize that it’s a bad idea to privilege a model that makes no logical sense over one that does, right? (Or, to put it differently, if you think that one *does* have to be a Bayesian for that, then I think you will be surprised at the number of people who are happy to be called Bayesians even though it apparently has no implications whatsoever for the way they do their analysis.)

No, you misunderstand. I’m not talking about a lack of a closed-form solution, I’m saying that there is no solution at all.

Here’s a simple example: you have data y_1,…,y_n from the logistic regression model, Pr(y_i=1) = invlogit(X_i*b), where X_i is an n*k matrix. Let’s say n=100 and k=10, just to be specific, and let’s also specify that X has no multicollinearity. Also, just to be specific, suppose we want a 95% confidence interval for b_1. There is no general procedure for defining such an interval, not in the sense that you mean, in which the interval is obtained by inverting a hypothesis test.

Nonetheless, practitioners want such intervals (in part, I’d argue, because they’ve been misled by the confident tones of statistics textbooks, but that’s another story). So procedures exist. But it’s not an exact science, it’s a bunch of rules and approximations.

That’s ok, not everything has to be an exact science. Even if something is an exact science, it depends on assumptions that we don’t in general believe (in your example, you want me to give you P_theta(D|theta) but in any real example I’ve ever seen, this probability distribution is only a convenient approximation).

So I don’t think it’s a devastating criticism on my part to say that hypothesis-test-inversion confidence intervals are not an exact science. They represent a statistical method that works in some important special cases, as well as a principle that can be applied with varying success in other situations. And that’s fine. The mistake has not been in people coming up with and using this method, the mistake is when people treat it as a universal principle for interval estimation.

]]>I was reluctant to start pointing fingers at people making “outlandish claims” and I debated whether starting with that now (as I don’t think it helps much for the overall discussion). But since I consider myself an observant person that likes basing her statements on some evidence, I want to combat the “straw man argument” accusation. Here is a great example of what I’m talking about: http://jakevdp.github.io/blog/2014/06/12/frequentism-and-bayesianism-3-confidence-credibility/

That person is clearly credible, clever and has more than basic stats training. Still he claims things like (I just take some statements):”Frequentism [in the context of a CI – added] and Science do not mix” or “The frequentist confidence interval […] is _usually_ [emphasis added] answering the wrong question” or “[…] if you follow the frequentists [idea of a CI – added] in considering “data of this sort”, you are in danger at arriving at an answer that tells you _nothing meaningful_ [emphasis added] about the particular data you have measured.” or “‘Given this observed data, I can put no constraint on the value of \(\theta\)’ If you’re interested in what your particular, observed data are telling you, frequentism [in the context of a CI – added] is _useless_.” Morey et al also claim “confidence intervals do not have the properties that are often claimed on their behalf” and the blog post here is entitled “fallacy of confidence in CI” (of both latter statements I think they were only chosen for reasons of rhetoric rather than substantive reasons).

To wrap my position up. I’m convinced that

a) CI can be useful procedures for scientific inference about plausible values of the unknown parameter

b) It is very important we know when and why and how they are and are not working well (just as it applies to every other inferential procedure).

I think we (and most if not all others on this blog) will agree with a) and b). For those who do not agree with a) (like the blogger I cited here), I suggest the game above.

]]>(I wonder how much of this disagreement is due to a semantic or philosophical difference in the definition of “probability”. I assert that e.g. “in the absence of any other information, there is a 95% probability that the obtained confidence interval includes the population mean” is completely correct and consistent as a statement about frequentist probability.)

Related to this, you’re very concerned in your paper that there’s more than one “valid” CP, and each one gives a different CI; you seem to think this invalidates the statement that a given CI has a 50% chance of containing theta. But if you perform two Bayesian analyses using different priors and different probability models, obviously you’ll get different posteriors; wouldn’t that mean that you can’t interpret posteriors as likelihoods either?

Thank you for clarifying where CP1 and CP2 come from. This example still doesn’t show what you claim, though, because the biggest difference between the three methods (CP1, CP2, and CredInt) is not in whether they’re Bayesian or frequentist but in how much information each one uses.

The Bayesian credibility interval (CredInt) that you give uses the length of the submarine, the fact that the bubble distribution is uniform along that length, and the separation between the bubbles; basically all of the available information.

It’s easy to construct a frequentist confidence interval procedure that uses all this as well (and it basically follows the argument you describe in your supplement for CredInt): Let dx=|x1-x2| be the separation between the bubbles. The farthest the mean of this sample could be from theta is (5-dx)/2, so the sampling distribution of the mean is a uniform distribution centred at theta with width (10 – dx). Thus the 50% CI is x-bar +/- (5-dx)/2. Which of course is the same as your CredInt, but arrived at using purely frequentist methods.

CP2 uses the submarine length and the uniform distribution information, but throws away the bubble separation, so obviously it’s going to perform more poorly. Could Bayesian methods do any better without using the bubble separation?

CP1 does use the bubble separation, but it doesn’t use any information about the submarine at all! If you don’t know whether the length of your submarine is 10mm, 10m, or 10km, it shouldn’t be a surprise that the CI you get from 2 data points is not that useful — and I don’t see how any method could give you a better idea of your measurement precision in that case! This seems to me to illustrate a serious problem with trying to use non-parametric methods with tiny sample sizes, but it doesn’t say anything about CIs in general. What would an _equivalent_ Bayesian credibility interval look like if the ONLY information it’s allowed to use is x1 and x2 (nothing about the bubble probability distribution, submarine size, etc)?

There are certainly many situations where Bayesian methods are the easiest and/or best ways to incorporate information. But I think the reason your submarine example strikes people as silly is because in this case there’s a very straightforward frequentist CP that you’re ignoring, which undermines your entire argument.

]]>You write, “Derivation of conf intervals is an exact science.” I’m not quite sure what is meant by “exact science” in this context but I don’t think your description is accurate. What, for example, is the exact science behind the derivation of confidence intervals for logistic regression coefficients?

The derivation of confidence intervals is an exact science in some simple examples but not in general.

]]>For many years, Moore and McCabe’s Introduction to the Practice of Statistics was pretty good, but it seems to have gone downhill since Moore retired and a third author was added.

]]>“In particular, [Fisher, as opposed to Neyman] would apply the fiducial argument, or rather would claim unique validity for its results, only in those cases for which the problem of estimation proper had been completely solved, i.e. either when there existed a statistic of the kind called sufficient, which in itself contained the whole of the information supplied by the data, or when, though there was no sufficient statistic, yet the whole of the information could be utilized in the form of ancillary information. Both these cases were fortunately of common occurrence, but the limitation seemed to be a necessary one, if they were to avoid drawing from the same body of data statements of fiducial probability which were in apparent contradiction.

“Dr. Neyman claimed to have generalized the argument of fiducial probability, and he had every reason to be proud of the line of argument he had developed for its perfect clarity. The generalization was a wide and very handsome one, but it had been erected at considerable expense, and it was perhaps as well to count the cost. The first item to which he would call attention was the loss of uniqueness in the result, and the consequent danger of apparently contradictory inferences.” (pp. 617-618)

Fisher also understood that there is not *one unique* way to build a confidence interval. [It is worth noting that in the submarine/uniform case, the Bayes/likelihood interval can be obtained by conditioning on the ancillary statistic, and hence Fisher would identify the objective Bayes interval as the unique fiducial interval. It is also worth noting that Fisher explicitly notes there that Neyman’s theory *does not require this*.]

]]>As it is the paper contains a relatively points out two issues that could arise in rare instances when, if they’re big issues they’ll generally be pretty obvious, and then rails agains the FCF that’s known to be false anyway. While the FCF may have proponents in their field, there are CI proponents who know the FCF false and it doesn’t dissuade them.

I’d be much happier with the paper if they changed their use of the word “general” in many places to the word “sometimes”.

]]>Under CI theory, there is *not* one way to build a CI. What matters is coverage of the true value and exclusion of false values (in long-run terms). There are sometimes several ways to approach this, leading to the counterintuitive results we discuss.

You should read Neyman (1937) and Welch (1939) before you post again.

(I also noted with some amusement that you and Erf suggested that we didn’t use the “obviously correct” way of generating the interval, and then you two suggested two *different* intervals as obviously correct, both of which we mention in the paper.)

]]>What’s happening is the CI’s are chasing after that 95% coverage. In order to get it in the long the normal unexceptional CI procedure will generate absurd intervals for certain data points. The insane part of this is that given the actual data, you know whether you’re in one of those bad cases. The Bayesian Credibility intervals are automatically taking this into consideration and giving the optimal interval estimate for the data actually seen.

More to the point, if you’re allowed to vary the bet sizes in Momo’s bet above, a Bayesian can use this fact to bankrupt the Frequentist very quickly (unless of course the CI’s are always equal to the equivalent Bayesian Credibility Interval).

Chalk one up for Bayes.

]]>The backwards conclusion Morey et al make at the end of their paper is, “we have shown that confidence in- tervals do not have the properties that are often claimed on their behalf” and it’s not subtle at all. That’s not about the specific intervals they looked at. They’re making a sweeping generalization and it’s using logic equally as bad as the FCF fallacy (using their terminology).

]]>If across repeated uses of the procedure in a given situation it doesn’t capture the true value 95% of the time then by definition it isn’t a confidence procedure for that situation. So that’s a pretty specious argument. The procedure for getting a proper credible interval can’t be misapplied either.

]]>