“The Internal and External Validity of the Regression Discontinuity Design: A Meta-Analysis of 15 Within-Study-Comparisons”

Jag Bhalla points to this post by Alex Tabarrok pointing to this paper, “The Internal and External Validity of the Regression Discontinuity Design: A Meta-Analysis of 15 Within-Study-Comparisons,” by Duncan Chaplin, Thomas Cook, Jelena Zurovac, Jared Coopersmith, Mariel Finucane, Lauren Vollmer, and Rebecca Morris, which reports that regression discontinuity (RD) estimation performed well in these 15 examples:

The RD bias is below 0.01 standard deviations on average, indicating RD’s high internal validity. When the study‐specific estimates are shrunken to capitalize on the information the other studies provide, all the RD causal estimates fall within 0.07 standard deviations of their RCT counterparts, now indicating high external validity. With unshrunken estimates, the mean RD bias is still essentially zero, but the distribution of RD bias estimates is less tight, especially with smaller samples and when parametric RD analyses are used.

Chaplin et al. are making two points:

1. The regression discontinuity estimates performed well, and this good performance could be checked by comparing to the estimates from the randomized clinical trial in each case.

2. Bayesian multilevel modeling with partial pooling made things even better.

I think of this paper as being similar to the classic Dehejia and Wahba paper on matching for observational studies. Dehejia and Wahba found that matching worked well, if it was done well, and they provided practical guidelines.

Similarly, in this new paper, Chaplin et al. found that regression discontinuity analysis performed well, in a set of examples where regression discontinuity analysis made sense.

I would’ve liked to have seen a scatterplot with 15 points, one for each study.

Causal magic?

When pointing me to this paper, Bhalla expressed concern that “Economists think regression discontinuity can evade statistical limits and perform causal magic.”

I don’t know what “economists” think in general, but I agree with Bhalla that at times it seems that practitioners think of regression discontinuity and other identification strategies as way of extracting causal inferences, but without thinking seriously about the required assumptions.

I don’t think the Chaplin et al. paper promotes that kind of magical thinking, but I see how it could be naively misinterpreted as “Regression Discontinuity Works” (the title of Tabarrok’s post). I’d rephrase this as, “Regression discontinuity can work well” or “Regression discontinuity works well when used appropriately.”

Here’s another example. In the comment thread of Tabarrok’s post:

wiki April 2, 2018 at 3:07 pm: I’m a bit more skeptical of RD based on historical data. We can’t do time travel RCT and a lot depends on being able to identify all the possible confounds and correcting for selection bias. Lots of current work doesn’t even adjust for human capital, biology, personality, cultural predisposition, or genes and just waves its hands about this. But this is the persistence RD that is hot in the development literature.

Sam April 2, 2018 at 3:37 pm: The whole point of RD is that you don’t need to identify confounds.

The first commenter is, broadly, correct; the second commenter is too confident. Or, to put it another way, the second commenter is correct in the settings where all the assumps hold, but regression discontinuity is often applied in settings where the assumps are off, in which case RD is little more than a crude, thoughtless, regression adjustment for observational data.

Just to be clear: the paper does represent good news for regression discontinuity analysis and hierarchical Bayesian modeling. I’m not trying to imply otherwise. I’m just clarifying that, like Dehejia and Wahba, Chaplin et al. are finding that their method works well in well-chosen settings. That’s not nothing; it’s good to know; it shouldn’t be taken to imply much in settings where the regression discontinuity assumptions or the fitted model don’t make sense. Of course. But I better say it here just so people don’t overinterpret.

Other issues

I don’t really agree with this statement by the first author, though, “RD is generally acknowledged as the most rigorous non-experimental method for obtaining internally valid impact estimates.” The rigor of any statistical inference depends on some set of assumptions. No method is inherently more rigorous than another; it all depends on where and how the method’s applied. To put it another way, I have little doubt that the regression discontinuity analyses in the above-linked paper are in settings where the assumptions are reasonable.

I also like when he refers to “less than 1,100 observations” as a “small sample size.” Remember that regression discontinuity analysis we looked at with N=27?

4 thoughts on ““The Internal and External Validity of the Regression Discontinuity Design: A Meta-Analysis of 15 Within-Study-Comparisons”

  1. > Or, to put it another way, the second commenter is correct in the settings where all the assumps hold, but regression discontinuity is often applied in settings where the assumps are off, in which case RD is little more than a crude, thoughtless, regression adjustment for observational data.

    There’s an ongoing scandal in finance and accounting involving a paper that

    1) used RD in a setting where the assumptions didn’t hold

    2) then changed the description of the methodology in the paper so the assumptions seemingly hold

    3) while not bothering to change any of the estimates reported in the tables

    http://econjwatch.org/articles/will-the-real-specification-please-stand-up-a-comment-on-andrew-bird-and-stephen-karolyi

    The paper’s authors responded by vaguely blaming the copy editor (gremlins?) for changing their description of the methodology.

    • Following this story, it certainly seems like they are guilty of something sketchy. They posted a 6 page response to SSRN that included incorrect code showing they didn’t know how to properly use the sort function in Stata and subsequently revised it to just 2 pages with the incorrect code portion removed. Also, it’s not strong evidence, but the sheer amount of papers that the two authors currently are working on makes it appear that they are, at the very least, fishing for results and writing them up as if they were hypothesis tests.

  2. The author you quoted isn’t saying RD is the most rigorous, but that it is generally regarded as the most rigorous.

    That seems undoubtedly true. IES now treats it equivalently with randomized experiments as their “gold standard” for evaluation.

    • Demosthenes:

      The quote is not that RD is “generally considered as the most rigorous”; it’s that RD is “generally acknowledged as the most rigorous.”

      “Acknowledged” implies that this is a true belief, not just a belief.

      I think RD can be great but I don’t think it makes sense to talk about it as more or less rigorous than other methods out there. All these methods are rigorous if their assumptions are satisfied.

Leave a Reply to Andrew Cancel reply

Your email address will not be published. Required fields are marked *