Why it doesn’t make sense in general to form confidence intervals by inverting hypothesis tests

Peter Bergman points me to this discussion from Cyrus of a presentation by Guido Imbens on design of randomized experiments.

Cyrus writes:

The standard analysis that Imbens proposes includes (1) a Fisher-type permutation test of the sharp null hypothesis—what Imbens referred to as “testing”—along with a (2) Neyman-type point estimate of the sample average treatment effect and confidence interval—what Imbens referred to as “estimation.” . . .

Imbens claimed that testing and estimation are separate enterprises with separate goals and that the two should not be confused. I [Cyrus] took it as a warning against proposals that use “inverted” tests in order to produce point estimates and confidence intervals. There is no reason that such confidence intervals will have accurate coverage except under rather dire assumptions, meaning that they are not “confidence intervals” in the way that we usually think of them.

I agree completely. This is something I’ve been saying for a long time—I actually became aware of the problem when working on my Ph.D. thesis, where I tried to fit a model that was proposed in the literature but it did not fit the data. Thus, the confidence interval that you would get by inverting the hypothesis test was empty. You might say that’s fine—the model didn’t fit, so the conf interval was empty. But what would happen if the model just barely fit? Then you’d get a really tiny confidence interval. That can’t be right.

Here’s what was happening:

Sometimes you can get a reasonable confidence interval by inverting a hypothesis test. For example, the z or t test or, more generally, inference for a location parameter. But if your hypothesis test can ever reject the model entirely, then you’re in the situation shown above. Once you hit rejection, you suddenly go from a very tiny precise confidence interval to no interval at all. To put it another way, as your fit gets gradually worse, the inference from your confidence interval becomes more and more precise and then suddenly, discontinuously has no precision at all. (With an empty interval, you’d say that the model rejects and thus you can say nothing based on the model. You wouldn’t just say your interval is, say, [3.184, 3.184] so that your parameter is known exactly.)

The only thing I didn’t like about the above discussion–it’s not Cyrus’s fault, I think I have to blame it on Guido–is the emphasis on the Fisher-style permutation test. As I’ve written before (for example, see section 3.3 of this article from 2003), I like model checking but I think the so-called Fisher exact test almost never makes sense, as it’s a test of an uninteresting hypothesis of exactly zero effects (or, worse, effects that are nonzero but are identical across all units) under a replication that typically doesn’t correspond to the design of data collection. I’d rather just skip that Fisher and Neyman stuff and go straight to the modeling.

OK, I understand that Guido has to communicate with (methodologically) ultraconservative economists. Still, I’d prefer to see the modeling approach placed in the center, and then he can mention Fisher, Neyman, etc., for the old-school types who feel the need for those connections. I doubt I would disagree with anything Guido would do in a data analysis; it’s perhaps just a question of emphasis.

P.S. I realize from the comments that my above example isn’t clear enough. So here is some more detail:

The idea is that you’re fitting a family of distributions indexed by some parameter theta, and your test is a function T(theta,y) of parameter theta and data y such that, if the model is true, Pr(T(theta,y)=reject|theta) = 0.05 for all theta. The probability here comes from the distribution p(y|theta) in the model.

In addition, the test can be used to reject the entire family of distributions, given data y: if T(theta,y)=reject for all theta, then we can say that the test rejects the model.

This is all classical frequentist statistics.

Now, to get back to the graph above, the confidence interval given data y is defined as the set of values theta for which T(y,theta)!=reject. As noted above, when you can reject the model, the confidence interval is empty. That’s ok since the model doesn’t fit the data anyway. The bad news is that when you’re close to being able to reject the model, the confidence interval is very small, hence implying precise inferences in the very situation where you’d really rather have less confidence!

This awkward story doesn’t always happen in classical confidence intervals, but it can happen. That’s why I say that inverting hypothesis tests is not a good general principle for obtaining interval estimates. You’re mixing up two ideas: inference within a model and checking the fit of a model.