It started with a project that Sharad Goel is doing, comparing decisions of judges in an urban court system. Sharad was talking with Avi Feller, Art Owen, and me about estimating the effect of a certain decision option that judges have, controlling for pre-treatment differences between defendants.
I’m interested in what that data shows about the relative skill of the numerous judges.
I expect some are stricter and some are more lenient. At any level of strictness it would be interesting to see whether any
are especially good at their decision. This would be one way to estimate whether ‘u’ [the hypothetical unobserved pretreatment covariates] matters. Probably a lot of the judges think they’re pretty good. Based on similar ratings of surgeons, I’d expect a bunch that you can’t tell apart, a few that are quite a bit worse than most, and just maybe, a few that are quite a bit better than most.
The sanity check I would apply to models is to resample the judges.
I have not done much with propensity scoring. I’m intrigued by the thought that it is not properly Bayesian. My first thought is that there should be a way to reconcile these things.
I agree that this is an interesting question. We’ve started looking at this recently, but the complication is that judge assignment isn’t completely random, and it does appear that some judges do indeed see higher risk defendants.
I’m still trying to understand if propensity scores are ever really preferably to the straightforward outcome regression. Andrew, am I correct in thinking that you would say “no, just do the regression”? One thing I find attractive about propensity scores is that then I can look at the balance plots, which gives me some confidence that we can estimate the treatment probabilities reasonably well. And at that point, I feel like it’s natural to use the IPW estimator (or the doubly robust estimator). But perhaps I should just interpret the balance plots as evidence that the outcome regression is also ok?
Yes, if the variables that predict the decision are included in the regression model then I’d say, yes, you got ignorability so just fit the damn model. The propensity scores are relevant to robustness. And you can make the balance plots even without doing any weighting.
That makes sense. But now I’m wondering why more people don’t look at the balance plots to check robustness of the outcome regression. I feel like I only see balance plots when people are actually using the propensity scores for something, like matching or IPW. Perhaps this is a common thing to do, and I’ve just missed that literature…?
Now Avi weighs in. Of all of us, Avi’s the only expert on causal inference. Here he goes:
A few thoughts on propensity scores and all this.
First, the (now classic) Kang and Schafer paper on “de-mystifying double robustness” is here. The simulation studies from this paper sparked a robust debate (sorry for the pun) in the causal inference literature. But I think that the bottom line is that there’s no magic bullet here—re-weighting estimators do better in some cases and regression-type estimators do better in some cases (of course, you can think of regression as a type of re-weighting estimator). In practice, with large samples and so long as the estimated propensity scores aren’t “too extreme,” then regression, IPW, and AIPW (i.e., double robust) estimates should all be in the same ball park. Thus, it’s reassuring—if not surprising—that you find similar results with these three approaches.
For what it’s worth, Andy’s view of “just fit the damn model” is not the majority view among causal inference researchers. Personally, I prefer matching + regression (or post-stratifying on the propensity score + regression), which is generally in line with Don Rubin. The inimitable Jennifer Hill, for example, usually jumps straight to IPW (though you should confirm that with her). Guido Imbens has tried a bunch of things (he has a paper showing some good properties of double robust estimators in randomized trials, for example).
In general, I find “global” recommendations here misplaced, since it will depend a lot on your context and audience. And trust me that there are a lot of recommendations like that! Sme people say you should never do matching, others say you should never do weighting; some say you should always be “doubly robust,” others say you should never be doubly robust; and so on…
As for balance checks: I agree that this is a terrific idea! You can check out the Imbens and Rubin textbook for a fairly in-depth discussion of some of the issues here. In your applied setting (and assuming that you’re still doing all three analyses), I like the idea of doing balance checks for the entire data set, for the re-weighted data set, and separately by stratum (i.e., deciles of the propensity score). You can get much fancier, but that seems like a sensible starting point.
“Double robustness” has never been mystifying to me, perhaps because it came up in our 1990 paper on estimating the incumbency advantage, where Gary and I thought hard about the different assumptions of the model, and about what assumptions were required for our method to work.
And now to get back to the discussion:
I’m curious what you guys think of entropy balancing. Reweight the data in order to attain balance of the covariates: http://arxiv.org/abs/1501.03571 by Zhao and Percival, following up on Hainmueller.
They use entropy. I’d have probably used an empirical likelihood or worked out a variance favorable criterion (possibly allowing negative weights).
To me it seems like a bunch of hype. I’m fine with matching for the purpose of discarding data points that are not in the zone of overlap (as discussed in chapter 10 of my book with Jennifer) and I understand the rationale for poststratifying on propensity score (even though I’m a bit skeptical of that approach), but these fancy weighting schemes just seem bogus to me, I don’t see them doing anything for the real goal of estimating treatment effects.
That seems pretty harsh. Can you parse ‘hype’ and ‘bogus’ for me?
Hype might mean that their method is essentially the same as something older, and you think they’re just stepping in front of somebody else’s parade.
But bogus seems to indicate that the method will lead people to wrong conclusions, either wrong math (unlikely) or wrong connection to reality.
I will defer to Avi on the details but my general impression of these methods is that they are designed to get really good matching on the weighted means of the pretreatment variables. I really don’t wee the point, though, as I see matching as a way to remove some data points that are outside the region of overlap.
To put it another way, I think of these weighting methods as optimizing the wrong thing.
The “hype” comes because I feel like “genetic matching,” “entropy balancing,” etc, are high-tech ways of solving a problem that doesn’t need to be solved. It seems like hype to me not because they’re ripping anyone off, but because it seems unnecessary. Kinda like sneakers with microchips that are supposed to tune the rubber to make you jump higher.
But, sure, that’s too strong a judgment. These methods aren’t useful to _me_, but they can be useful to many people. In particular, if for some reason a researcher has decided ahead of time to simply compare the two groups via weighted averages—that is, he or she has decided _not_ to run a regression controlling for the variables that went into the propensity score—then, sure, it makes sense to weight to get the best possible balance.
Since I’d run the regression anyway, I can’t do anything with the weights. Running weighted regression will just increase my standard errors and add noise to the system. Yes the weights can give some robustness but most of that robustness is coming from excluding points outside the region of overlap.
That said, regression can be a lot of work. Jennifer Hill has that paper where they used BART, and it was a lot of work. I’d typically just do the usual linear or logistic regression with main effects and interactions. So in practice I’m sure there are problems where weighting would outperform whatever I’d do. I’m just skeptical about big machinery going into weighting because, as Avi said, the big thing is the ignorability assumption anyway.
One quick plug for the upcoming Atlantic Causal Inference Conference in NYC: we’ll be hosting a short “debate” between Tyler VanderWeele and Mark VanderLaan (the “vanderBate”). Mark argues that we should really focus on the estimation method we use in these settings—double-robustness and machine learning-based approaches (like TMLE), he believes, are strongly preferable to parametric regression models. Tyler, by contrast, argues that what really matters in all of this is the ignorability assumption and that we should be focused much more on questions of sensitivity. As you might imagine, I’m very much on Tyler’s side here.
My take away is that in practice it makes sense to just try both approaches (“fit the damn model” + your favorite weighting scheme), and check that the answer doesn’t depend too much on the method. If it does, then I guess you’d have to think a bit more carefully about which method is preferred, but if it doesn’t then it’s just one less thing to worry about….
Is there any way to check if the ignorability assumption is reasonable? For the bail problem, do we just have to assert that it’s unlikely a judge can glean much useful information by staring into a defendant’s eyes, or is there a more compelling argument to make?
Ignorability is an assumption but it can be possible to quantify departures from ignorability. The idea is to make predictions of distributions under the model and have some continuous nonignorability parameter (that’s 0 if ignorable, + if selection bias in one way, – if selection bias in another way). Obv this 1-parameter model can’t capture all aspects of nonignorability but you might be able to have it capture the departures of particular concern. Anyway, once you have this, you can make inferences under different values of this parameter and you can assess whether the inferences make sense. In your example below, the idea would be to model how much information the judge could plausibly learn from the defendant’s eyes, over and above any info in the public record.
There’s a massive literature on this sort of thing. Some immediate suggestions:
Seminal paper from Rosenbaum and Rubin here)
Guido Imbens’ version here
Paul Rosenbaum’s textbooks (though these are all randomization based), here
Recent work from my long-time collaborator Peng Ding, here
Happy to suggest more. I’m not really doing justice to the biostats side.
Lots of good stuff here. Let me emphasize that the fact that I consider some methods to be “hype” should not be taken to imply that I think they should never be used. I say this for two reasons. First, a method can be hyped and still be good. Bayesian methods could be said to be hyped, after all! Second, I have a lot of respect for methods that people use on real problems. Even if I think a method is not optimal, it might be the best thing standing for various problems.