It started with a project that Sharad Goel is doing, comparing decisions of judges in an urban court system. Sharad was talking with Avi Feller, Art Owen, and me about estimating the effect of a certain decision option that judges have, controlling for pre-treatment differences between defendants.

Art:

I’m interested in what that data shows about the relative skill of the numerous judges.

I expect some are stricter and some are more lenient. At any level of strictness it would be interesting to see whether any

are especially good at their decision. This would be one way to estimate whether ‘u’ [the hypothetical unobserved pretreatment covariates] matters. Probably a lot of the judges think they’re pretty good. Based on similar ratings of surgeons, I’d expect a bunch that you can’t tell apart, a few that are quite a bit worse than most, and just maybe, a few that are quite a bit better than most.The sanity check I would apply to models is to resample the judges.

I have not done much with propensity scoring. I’m intrigued by the thought that it is not properly Bayesian. My first thought is that there should be a way to reconcile these things.

Sharad:

I agree that this is an interesting question. We’ve started looking at this recently, but the complication is that judge assignment isn’t completely random, and it does appear that some judges do indeed see higher risk defendants.

I’m still trying to understand if propensity scores are ever really preferably to the straightforward outcome regression. Andrew, am I correct in thinking that you would say “no, just do the regression”? One thing I find attractive about propensity scores is that then I can look at the balance plots, which gives me some confidence that we can estimate the treatment probabilities reasonably well. And at that point, I feel like it’s natural to use the IPW estimator (or the doubly robust estimator). But perhaps I should just interpret the balance plots as evidence that the outcome regression is also ok?

Andrew:

Yes, if the variables that predict the decision are included in the regression model then I’d say, yes, you got ignorability so just fit the damn model. The propensity scores are relevant to robustness. And you can make the balance plots even without doing any weighting.

Sharad:

That makes sense. But now I’m wondering why more people don’t look at the balance plots to check robustness of the outcome regression. I feel like I only see balance plots when people are actually using the propensity scores for something, like matching or IPW. Perhaps this is a common thing to do, and I’ve just missed that literature…?

Now Avi weighs in. Of all of us, Avi’s the only expert on causal inference. Here he goes:

A few thoughts on propensity scores and all this.

First, the (now classic) Kang and Schafer paper on “de-mystifying double robustness” is here. The simulation studies from this paper sparked a robust debate (sorry for the pun) in the causal inference literature. But I think that the bottom line is that there’s no magic bullet here—re-weighting estimators do better in some cases and regression-type estimators do better in some cases (of course, you can think of regression as a type of re-weighting estimator). In practice, with large samples and so long as the estimated propensity scores aren’t “too extreme,” then regression, IPW, and AIPW (i.e., double robust) estimates should all be in the same ball park. Thus, it’s reassuring—if not surprising—that you find similar results with these three approaches.

For what it’s worth, Andy’s view of “just fit the damn model” is not the majority view among causal inference researchers. Personally, I prefer matching + regression (or post-stratifying on the propensity score + regression), which is generally in line with Don Rubin. The inimitable Jennifer Hill, for example, usually jumps straight to IPW (though you should confirm that with her). Guido Imbens has tried a bunch of things (he has a paper showing some good properties of double robust estimators in randomized trials, for example).

In general, I find “global” recommendations here misplaced, since it will depend a lot on your context and audience. And trust me that there are a lot of recommendations like that! Sme people say you should never do matching, others say you should never do weighting; some say you should always be “doubly robust,” others say you should never be doubly robust; and so on…

As for balance checks: I agree that this is a terrific idea! You can check out the Imbens and Rubin textbook for a fairly in-depth discussion of some of the issues here. In your applied setting (and assuming that you’re still doing all three analyses), I like the idea of doing balance checks for the entire data set, for the re-weighted data set, and separately by stratum (i.e., deciles of the propensity score). You can get much fancier, but that seems like a sensible starting point.

“Double robustness” has never been mystifying to me, perhaps because it came up in our 1990 paper on estimating the incumbency advantage, where Gary and I thought hard about the different assumptions of the model, and about what assumptions were required for our method to work.

And now to get back to the discussion:

Art:

I’m curious what you guys think of entropy balancing. Reweight the data in order to attain balance of the covariates: http://arxiv.org/abs/1501.03571 by Zhao and Percival, following up on Hainmueller.

They use entropy. I’d have probably used an empirical likelihood or worked out a variance favorable criterion (possibly allowing negative weights).

Me:

To me it seems like a bunch of hype. I’m fine with matching for the purpose of discarding data points that are not in the zone of overlap (as discussed in chapter 10 of my book with Jennifer) and I understand the rationale for poststratifying on propensity score (even though I’m a bit skeptical of that approach), but these fancy weighting schemes just seem bogus to me, I don’t see them doing anything for the real goal of estimating treatment effects.

Art:

That seems pretty harsh. Can you parse ‘hype’ and ‘bogus’ for me?

Hype might mean that their method is essentially the same as something older, and you think they’re just stepping in front of somebody else’s parade.

But bogus seems to indicate that the method will lead people to wrong conclusions, either wrong math (unlikely) or wrong connection to reality.

Me:

I will defer to Avi on the details but my general impression of these methods is that they are designed to get really good matching on the weighted means of the pretreatment variables. I really don’t wee the point, though, as I see matching as a way to remove some data points that are outside the region of overlap.

To put it another way, I think of these weighting methods as optimizing the wrong thing.

The “hype” comes because I feel like “genetic matching,” “entropy balancing,” etc, are high-tech ways of solving a problem that doesn’t need to be solved. It seems like hype to me not because they’re ripping anyone off, but because it seems unnecessary. Kinda like sneakers with microchips that are supposed to tune the rubber to make you jump higher.

But, sure, that’s too strong a judgment. These methods aren’t useful to _me_, but they can be useful to many people. In particular, if for some reason a researcher has decided ahead of time to simply compare the two groups via weighted averages—that is, he or she has decided _not_ to run a regression controlling for the variables that went into the propensity score—then, sure, it makes sense to weight to get the best possible balance.

Since I’d run the regression anyway, I can’t do anything with the weights. Running weighted regression will just increase my standard errors and add noise to the system. Yes the weights can give some robustness but most of that robustness is coming from excluding points outside the region of overlap.

That said, regression can be a lot of work. Jennifer Hill has that paper where they used BART, and it was a lot of work. I’d typically just do the usual linear or logistic regression with main effects and interactions. So in practice I’m sure there are problems where weighting would outperform whatever I’d do. I’m just skeptical about big machinery going into weighting because, as Avi said, the big thing is the ignorability assumption anyway.

Avi:

One quick plug for the upcoming Atlantic Causal Inference Conference in NYC: we’ll be hosting a short “debate” between Tyler VanderWeele and Mark VanderLaan (the “vanderBate”). Mark argues that we should really focus on the estimation method we use in these settings—double-robustness and machine learning-based approaches (like TMLE), he believes, are strongly preferable to parametric regression models. Tyler, by contrast, argues that what really matters in all of this is the ignorability assumption and that we should be focused much more on questions of sensitivity. As you might imagine, I’m very much on Tyler’s side here.

Sharad:

My take away is that in practice it makes sense to just try both approaches (“fit the damn model” + your favorite weighting scheme), and check that the answer doesn’t depend too much on the method. If it does, then I guess you’d have to think a bit more carefully about which method is preferred, but if it doesn’t then it’s just one less thing to worry about….

Is there any way to check if the ignorability assumption is reasonable? For the bail problem, do we just have to assert that it’s unlikely a judge can glean much useful information by staring into a defendant’s eyes, or is there a more compelling argument to make?

Me:

Ignorability is an assumption but it can be possible to quantify departures from ignorability. The idea is to make predictions of distributions under the model and have some continuous nonignorability parameter (that’s 0 if ignorable, + if selection bias in one way, – if selection bias in another way). Obv this 1-parameter model can’t capture all aspects of nonignorability but you might be able to have it capture the departures of particular concern. Anyway, once you have this, you can make inferences under different values of this parameter and you can assess whether the inferences make sense. In your example below, the idea would be to model how much information the judge could plausibly learn from the defendant’s eyes, over and above any info in the public record.

Avi:

There’s a massive literature on this sort of thing. Some immediate suggestions:

Seminal paper from Rosenbaum and Rubin here)

Guido Imbens’ version here

Paul Rosenbaum’s textbooks (though these are all randomization based), here

Recent work from my long-time collaborator Peng Ding, here

Happy to suggest more. I’m not really doing justice to the biostats side.

Lots of good stuff here. Let me emphasize that the fact that I consider some methods to be “hype” should *not* be taken to imply that I think they should never be used. I say this for two reasons. First, a method can be hyped and still be good. Bayesian methods could be said to be hyped, after all! Second, I have a lot of respect for methods that people use on real problems. Even if I think a method is not optimal, it might be the best thing standing for various problems.

Isn’t a standard justification for using propensity scores is that it is a straightforward way to reach the semiparametric efficiency bound in estimating the ATE or ATT? As in http://onlinelibrary.wiley.com/doi/10.1111/1468-0262.00442/abstract And the treatment is conveniently binary with p not close to 0 or 1.

Dean: I think that’s a common justification, but it’s unclear how often those who make such an appeal actually use nonparametrically-estimated inverse weights. My guess is most people just use a plain vanilla logit.

“We show that weighting by the inverse of a nonparametric estimate of the propensity score, rather than the true propensity score, leads to an efficient estimate of the average treatment effect,” Hirano, Imbens, & Ridder (2003).

Andrew: “Running weighted regression will just increase my standard errors and add noise to the system.”

This is a common intuition, but I don’t think this is true in general.

It’s true unless you have heteroscedasticity and a constant treatment effects model. Otherwise weighted regression can simply increase your standard errors or increase bias. There’s an interesting Hausman test for OLS versus GLS which is a good example to think about weighted regression.

You’re talking about the case where the regression’s estimand is the quantity of interest, no? This won’t be the case when that model is misspecified (such as because of heterogeneous effects).

I don’t even begin to understand how weighting and propensity comes into questions relevant to *individual judges* skill at say estimating whether a given defendant is going to behave (and hence what the bail should be).

Weighting and propensity etc seem to me to be relevant to questions like medical drug treatments. For example:

People come to a dermatology clinic for Eczema. The doctor gives them a list of possible creams and suggests they choose one and try it, and then come back later to assess their skin.

We have a before assessment, a cream choice, and an after assessment.

However, the cream choice is not random. Looking at the distribution of (after-before) under the choice of cream doesn’t tell us about the causal effect of the cream, it tells us about the causal effect of the *cream choice* in the context of that individual’s situation (income, cause of eczema, sex, preference for perfumes or dyes, preference for certain kinds of ingredients, diet, etc).

We aren’t interested in the individual effect on the person (which is measured, with error) we’re interested in the future effect of recommending cream X on the next patient to come to the clinic (considered as a random sample of eczema patients living in the local region).

Weighting and propensity matching etc can help us say “for people like Foo, cream Bar seems to be best” by grouping people together who are similar, or “for our full population of patients in this area, cream Baz seems to be best” by weighting the observed results according to the local population characteristics.

But, how is any of that relevant to “for this individual judge, how well do they do at guessing how defendants will behave once on bail?”

I can imagine you might want to do inference along the lines of: “For all the people who are qualified to become judges in an area, how well would a random one perform once assigned to the bench?” But this is problematic because we have to assume that current judges have learned through time, so the performance of a new judge wouldn’t be necessarily similar.

It seems like if we want to know about the known fixed relatively small population of sitting judges, we should first do inference on each one, and then see what the population of inferences looks like.

Yeah – I feel like this Judges example is a classic case of us being interested in Unobservable characteristics of judges (their intuition, their compassion, their underlying willingness/desire to lock people in cages for much of their lives), but pretending we can capture those using Observable characteristics (such as age, education, race, tenure; summarized as a propensity score). Then again, it is a bit hard for me to understand the context, since I don’t understand how some judges could fall outside the “support” set of observable characteristics… do we hire judges without a high school degree?

Also: It is not at all clear to me that the idea of “quality of judgement on probability of re-offending conditional on how much you like to lock people up” makes a lot of sense. For starters, if a judge is balancing public safety and individual liberty, then judges are likely trading off severity of punishment and probability of re-offending. But it isn’t like the “strictness” is a fixed attribute, it is determined jointly with “ability to discriminate and effort to determine likelihood of re-offending”. And it also isn’t clear to me why the question “which of the meanest judges out there are the best at identifying re-offenders” is all that interesting. Wouldn’t it be more interesting to know if judges with higher/lower propensity for locking people up for long periods of time were better/worse at predicting recidivism?

There are also two-parameter (confounder relationship to treatment; confounder relationship to outcome) extensions to the Rosenbaum-type sensitivity analyses that can be sometimes useful, notably Gastwirth, Krieger, and Rosenbaum (1998) and Small, Gastwirth, Krieger, and Rosenbaum (2009)

http://dx.doi.org/10.1093/biomet/85.4.907

http://dx.doi.org/10.4310/SII.2009.v2.n2.a10

I am taking the challenge here as getting equal cases/defendants on which to compare judges as taken to being different form each other – how does one discern those differences on possibly different cases/defendants?

But in general, when I was in Oxford (2002/3) I invited Paul Rosenbaum to dinner at my college and asked him what his original motivation was from developing propensity scores. He answer was something like “I wanted something straightforward and direct that anyone would easily understand”.

Perhaps making the same mistake, I tried to write an expository note on the various strategies initially covered by Don Rubin in the 1970’s – matching, regression and covariate balancing (pre propensity scores) and mixtures of these.

Never finished that as I left the group that had asked for it, but this was my non-tough guy take on propensity scores “The propensity score technology was a useful ladder to have climbed up even though now it can be safely kicked aside because it both made it transparent just how unbalanced covariate distributions were between comparison groups to start with and allowed a subsequent check on whether the matching (or stratification) made them balanced enough to validly proceed with making any comparisons. This soon became an iterative search for matching that leads to better balance of covariate distributions and perhaps when possible better matching of covariate values themselves.”

What I find more interesting here is the the “vanderBate” involving the two solitudes of statistics – Tyler being concerned with critically assessing and getting representations of reality less wrong (ignorability assumptions) and Mark being concerned with what are good properties techniques should have in this setting and trying select to optimize those properties (focus on the estimation method we use). Or maybe I am just seeing that everywhere these days.

Presumably we make our assessments conditional on the information available to the judge, that is, the ExternalFactors in my above post.

I think your point is that some judges may see mostly one little corner of the ExternalFactors spectrum, whereas other judges see a different corner. To me, that’s an argument for a realistic model of “extrapolation error” in the f(a,b,c,d,E) function. In other words, make the uncertainty in f be related continuously to the different regions of E space. If that makes it so we can’t compare judge A who sees lots of traffic court cases to judge B who sees largely violent drug offenders… then so be it.

This is an excellent post. I’m always glad that you share exchanges like these in the blog!

Regarding covariate balance, this summarizes a lot of points I’ve been thinking about recently. It sounds like if you’re going to be Bayesian (or more generally, if you follow the idea of positing a single model and then performing inference), you should do everything jointly. Sure, covariate balance is an essential property to have, but it sounds more like these techniques should go into how you posit your model. (For example, you might have a latent variable which represents a data point’s covariate as part of a “balanced” latent space; the potential outcomes then depend on this latent variable rather than the covariates directly.)