A colleague pointed me to a recent paper, “Does Regression Produce Representative Estimates of Causal Effects?” by Peter Aronow and Cyrus Samii, which begins:

With an unrepresentative sample, the estimate of a causal effect may fail to characterize how effects operate in the population of interest. What is less well understood is that conventional estimation practices for observational studies may produce the same problem even with a representative sample.

Linear regression, controlling for pre-treatment variables, is the standard method for causal inference in experiments and observational studies. The idea is that the regression on background variables serves to adjust for differences between the treatment and control group so that comparable groups are effectively being compared in the causal analysis.

Aronow and Samii’s point is that, when the treatment effect varies (that is, when the treatment is more effective for some people than others, which of course is the case in general), that the estimate from a regression, even if it controls for the right variables, will not in general give an easily interpretable estimate of an average treatment effect.

They’re right; once the treatment effect can vary, the linear model is no longer correct, and so estimates from linear regression will not generally have any clean interpretation.

This is related to my 2004 article on varying treatment effects. (That paper appeared in an obscure volume so I’m not criticizing Aronow and Samii for not knowing about it.)

In short, yes, when you have varying treatment effects the additivity assumption of regression no longer holds. I’m not quite sure why they keep referring to “multiple regression” but maybe that is just their way of emphasizing that the problem arises in general, not just when there’s only one pre-treatment predictor.

I don’t know that I really see the point of the weighting scheme of Aronow and Samii: ultimately I think the right way to go is to just fit the more general model that allows treatment effect variation, as we do for example in this paper from 2008. Now that we have Stan, it’s much easier to fit this sort of model that has multiple variance components.

That said, the statistical problem is not easy, as the variance of the treatment effects is essentially nonidentified without strong assumptions about the structure of the problem. (In my 2004 paper I talk about additive, subtractive, and replacement models for treatment effects, but these are just three special cases in a continuous space of possible structures.) So: prior information is needed.

One thing that I think would help—it comes up in our 2004 and 2008 papers—is a model that explicitly allows treatment to have a different structure than control, that is, does *not* consider the two treatments symmetrically.

In any case, let me emphasize that (a) I agree with the authors that varying treatment effects are important, and (b) I am not particularly interested in the various definitions of “average treatment effect” that float around in the causal inference literature. I understand why Rubin and others like these expressions, as they make precise what people are estimating (or trying to estimate) with these models. But, ultimately, I’d like to model treatment effect variation directly.

Also, I hope they clean up the figures before publication. Their world map is full of a massive Greenland and Antarctica. Whassup with that? There are better map projections out there than a straight lat-long grid. And the tables: average age “47.58”—c’mon!

And one other, minor thing: on p.2, they write that people “praise matching methods as an alternative to regression.” I know that some people think this, but it would be a good idea to make clear that matching is not, in general, an *alternative* to regression (see here and here). You do matching and then you can run regression on your matched set. Matching deals with lack of overlap, regression deals with imbalance.

Anyway, I think this is a valuable paper in that they’re drawing people’s attention to a real problem. There are various ways to attack the problem and I can’t say I’ve come up with any great solutions myself, so I’m glad to see other people taking a crack at it.

**Experiments and observational studies**

I showed the above to my colleague, who added:

Keep in mind the point of their paper is, no longer can we prioritize obs/regression over experiment for reasons of external validity, which they see as devastating given that they allow for almost no other benefits to obs work, if any.

I’ve written about experiments and observational studies before

As a statistician, I was trained to think of randomized experimentation as representing the gold standard of knowledge in the social sciences, and, despite having seen occasional arguments to the contrary, I still hold that view, expressed pithily by Box, Hunter, and Hunter (1978) that “To find out what happens when you change something, it is necessary to change it.”

At the same time, in my capacity as a social scientist, I’ve published many applied research papers, almost none of which have used experimental data. . . .

So, in response to my colleague, let me just say that the main advantage of observational studies is obvious: it’s that observational data are already available, or are relatively cheap to gather, whereas experiments take effort and in many settings can’t easily be done at all.

“Matching deals with lack of overlap, regression deals with imbalance.”

That statement seems to imply that you believe matching does not deal with imbalance. Rosenbaum talks at great length about how to achieve balance through matching. He views balance as the primary goal of matching in fact. Do you have a different definition of ‘balance’ than Rosenbaum?

I cannot overstate the virtue of approaching, even at great distance and by indirect means, the impossible.

I think this paper is mostly in reply to Deaton and others who claim large panel regressions help deal with external validity. I sympathize a lot with the pushback in this manuscript.

What confuses me about this paper, however, is that, in my understanding, it is not about varying treatment effects, or external validity, but about a mismatch between estimator and estimand.

The complaint — as I understand it — is that the the OLS estimator does not estimate the SATE estimand whenever treatment effects vary (at which point the representativeness of the sample is moot). However, to me that is not a problem with regression. It is a problem of choosing the wrong estimator for the estimand of interest. An estimator that uses one weighting scheme is unlikely to converge to an estimand that uses a different weighting scheme.

I can see how there is much value in highlighting that these two sets of weights are different. But you can’t blame the mousetrap for failing to catch butterflies.

Sure. But no excuse for that distorted world map!

Which distortion do you prefer, Andrew?

For a somewhat lighthearted take on the issue see:

“Mercator versus Peters projection on West Wing – Cartographers for Social Equality”

https://www.youtube.com/watch?v=LA0BLrLW0PE

Daniel:

Every map projection is distorted but some projections, like the one shown in the above post, are essentially dominated by others. For the purpose of the article in question, there are lots of options that dominate the massive-Antarctica version shown above. Mercator of course is not a serious candidate here, as the purpose of the map is to display countries, not to plot the trajectories of a ship.

Andrew:

Can you name a projection that you’d use for this application? Is equal-area projection is the property that you want?

Rahul:

It doesn’t have to be equal-area, it just shouldn’t be ridiculous. Plotting lat vs long is ridiculous: it distorts areas, shapes, just about everything. Removing Antarctica from the map, that alone would help a lot.

There are lots of better alternatives. Just for example I googled *National Geographic world map* and found this.

Or something really fancy ;-)

http://en.wikipedia.org/wiki/Peirce_quincuncial_projection

Aiken and West’s 1991 book “Multiple Regression” addresses questions related to setting up and interpreting complex, curvilinear, scale-type mixtures of hypotheses regarding interactions, and so on. In particular, for models with interaction(s), the model origin and intercept is assumed to be at zero, not the average as is the case for models without an interaction. For continuous variables, their recommendation is to center the variables before taking the cross-product. Centering also has the nice property of removing “ill conditioning” in models with polynomials that would otherwise be hugely collinear.

They make a strong, well-documented case. I can’t tell from this discussion if they’ve been contradicted. Is this advice in error?

Centering doesn’t change anything substantially about the model. The coefficients are more easily interpretable but they are still just particular effects (with uncertainty) conditional on the other covariates being close to their average value. You can always transform and calculate the uncertainty for the whole range and just display it as a graph. In a model with an interaction *any* inference based just on the marginal effect and uncertainty for one particular specification is misleading.

The problems discussed here are more general, though.

See Kam and Franzese (2007), e.g. in the introduction: “[centering] alters nothing important

statistically and nothing at all substantively” (p. 3) or just the discussion on multicolinearity in Brambor et al. 2006. There are probably far more exhaustive discussions but-as mentioned-the point is, that you’re seldom just interested in the marginal effect of X and its uncertainty for a particular value of the other interacting variable Z (e.g. 0 or mean(Z)) but for the whole range of Z.

Brambor, Thomas, William Roberts Clark, und Matt Golder. „Understanding interaction models: Improving empirical analyses“. Political analysis 14, Nr. 1 (2006): 63–82.

Kam, Cindy, und Robert Franzese. Modeling and interpreting interactive hypotheses in regression analysis. Ann Arbor, Mass.: University of Michigan Press, 2007.

Brambor, et al, make a compelling case for the inclusion of all “constitutive” factors in regression to ensure the minimization of the bias from non-inclusion. By this, they refer, e.g., to including all of the one-way and two-way components leading up to a three-way interaction model. While they cite the more broadly methodological Aiken and West text approvingly, they unambiguously contradict AW’s prescriptions regarding the impact of centering on multicollinearity. Brambor’s position is that centering has no effect on multicollinearity and, by extension, that the importance of collinearity has been exaggerated in the literature, “Even if there really is high multicollinearity and this leads to large standard errors on the model parameters, it is important to remember that these standard errors are never in any sense ‘‘too’’ large—they are always the ‘‘correct’’ standard errors. High multicollinearity simply means that there is not enough information in the data to estimate the model parameters accurately and the standard errors rightfully reflect this” (page 8). Their dismiss the importance of the standard errors of the parameters as secondary to the overall standard error of the model.

This is a big, interesting methodological shift in focus. The question is, does their prescription make sense? At least since Belsey, Kuh and Wallace’s book, the field of regression diagnostics has used metrics such as the VIF or eigenvalue decompositions to detect collinearity among model parameters. As a practitioner, I am less interested in theoretical nostrums — e.g., “high multicollinearity simply means (sparse) data” — than in the impact that overfitting can have in degrading model predictions. Experience is telling me to take Brambor, et al’s advice with more than a grain of salt.

Daniel:

The statement, “centering alters nothing important statistically and nothing at all substantively,” is correct for maximum likelihood estimation but not for Bayesian inference or for just about any other regularization procedure, including lasso or even stepwise regression. It is a common error to evaluate the properties of a particular statistical model without recognizing that any one model is nested within a larger procedure of model choice.

Any comments about the solution suggested here:

[Estimating Heterogeneous Treatment Effects with Observational Data](https://db.tt/fzKBdWLm

)

The simplest version of this, in my opinion, is comparing estimates of delta = E[Y|D=1] – E[Y|D=0] from post-stratification on a discrete covariate X with fixed effects for the levels of X.

Post-strat weights by the N in each cell, N_x.

With fixed effects, OLS is doing precision weighting in estimating delta, such that each cell gets weight N_x Var(D|X=x).

This is discussed in Angrist and Pischke’s Mostly Harmless. Though if I recall correctly, A&P just point out the weighting function for fixed effects, and don’t really suggest using post-strat.

Andrew, do you think that part of the reason we see less experiments is also the institutional structure of the top journals? I can only speak to my experience (still relatively new to the field) in quantitative political economy and finance, but there often seems to be a greater allure with a fancy or beautiful statistical/mathematical model structure, created to deal with the imperfections of The Standard Data Set. Whereas the effort and reality of trying to run a real experiment is often viewed as, at least in my experience, dirty.