Dan Gibbons writes:

I have been looking at using synthetic control estimates for estimating the effects of healthcare policies, particularly because for say county-level data the nontreated comparison units one would use in say a difference-in-differences estimator or quantile DID estimator (if one didn’t want to use the mean) are not especially clear. However, given that most of the data surrounding alcohol or drug use is from surveys, this would involve modifying these procedures to use either weighting or Bayesian hierarchical models to deal with the survey design. The current approach is just to assume that means are THE aggregate population-level mean, and ignore the sample error. I don’t like this particularly much, especially because with the synthetic control approach specifically the constructed `artificial’ region may be constructed out of the wrong donor units if the sample errors of the donor units are ignored.

My main subsequent worry here is that using a model-based weighting approach followed by the synthetic control estimator would basically be too much complexity to estimate from surveys in practice, in that sample sizes needed would be inflated and the estimates could become deeply unreliable (my fellow econometricians tend to rely on asymptotic results here, but I don’t trust them much in practice).

I am wondering from your perspective, whether it is best with cases that are more complicated than simple regression to either ignore the survey design, use the given sample survey weights or use a model-based approach? I know there is no “answer” to this question, I just worry that the current approach in economics which largely involves shrugging is not acceptable and thought you might be able to give some pointers.

My reply:

1. I generally think it’s best to include in your regression any relevant pre-treatment variables that are predictive of inclusion in the survey, and also to consider interactions of the treatment effect with these predictors. This points toward a MRP approach, in which you use multilevel modeling to get stable estimates of your coefficients, and you poststratify to get average treatment effects of interest.

I’d like a good worked example of this sort of analysis in the causal inference context. Once we have something specific to point to, it should be easier for people to follow along and apply this method to their problems.

2. Regarding “shrugging is not acceptable”: one way to demonstrate or to check this is through a fake-data simulation.

So you are saying there are two sources of variation in the synthetic control estimate: 1) from the kind Abadie and co. first describe, which is from “natural” drift between the SC group and the treatment group in the post period, which is captured by their placebo test; and 2) from the fact that the aggregate group-by-period observations we use as themselves estimates.

Can you bootstrap the group-by-period level observations every time you run a different randomization permutation (giving you very many new randomization inference draws)? Or, I guess I should ask this first: what is the thing you want uncertainty about? The effect size? Or the synthetic control weights themselves? And how would you think to represent that uncertainty (in a graph of placebos? as a pseudo-confidence interval? As a kind of p-value?)?

I also am not sure there is much of any role for asymptotic theory in SC, in the sense that it has never been clear to me what is “random” there. Not the treatment assignment, not the timing of treatment, not the donor units, only the period-to-period noise within a group. You are right that part of that noise comes from sampling variation in the observations themselves and that that part of the error seems different than the kind that is just unexplained by X. And we should try to account for that. But I prefer to think of it in terms of re-sampling, because that is the only thing that makes sense to me in the SC context.

Also, philosophical nitpicking: the SC weights WILL be constructed from the “wrong” donor units, because there is no such thing as the “right” donor units. I mean, what in the world would that even mean? There are no “right” units, because the world doesn’t work that way (Maine is not 1/3 Texas and 2/3 Idaho). Sorry, sometimes I get all metaphysical about synthetic controls. But despite the way many people interpret their SC models, I tend to think there is very little “structural” matching going on (based on substantive similarities in predictors across groups) and very much “level” matching (in the sense of the pre-period Y matching doing most of the work). And once you think that SC is just matching on noise, you start to worry that it is like the Garden of Forking Matches…. instead of researchers looking for all the stars and coming up with an interpretation they find meaningful, the algorithm just finds for you the weighting scheme that gets the most stars (best pre-period fit) and so you believe it must be meaningful.

Now watch one of the smart people who invented this come and school me.

I’m not particularly familiar with synthetic control literature, but reading the wiki on the technique makes me think they have a Bayesian interpretation in terms of a mixture model. The weights are mixture weights, and live on a simplex. The idea is that:

CounterfactualOutcome(G1) = Outcome(G2)w(G2) + Outcome(G3)w(G3) … Outcome(Gn)w(Gn) + Error

so that you can estimate what would have happened if G1 hadn’t had some treatment using what did happen in G2..Gn groups that were in fact untreated

Is that about the sum of it (pun!)?

If so, the next thing to realize is that the weights are designed to make the linear combination of other groups “a lot like” group G1, and so if there are measurements which were plausibly unaffected by your treatment, say demographic makeup, income, consumption of durable goods, whatever then “good” values for the weights will make all of these other measurements track the G1 group’s measurements to within what should be considered “reasonable” error. “reasonable” error means “a prior for Error_i” for each i being a control measurement you have.

So, good Bayesian uncertainty on the weights comes from treating your Counterfactual estimate as one of many estimates coming from the mixture model, finding w values that simultaneously make the synthetic control do a good job of tracking all the estimates, and using the full posterior to estimate the counterfactual of interest.

I should say that it makes a LOT of sense to take all your various measurements and convert them to dimensionless ratios of O(1). For example incomes as fraction of GDP/capita, durable good consumption as fraction of national per capita consumption, rate of smoking as fraction of national rate, percent of black population as fraction of national percentage, etc. so that you’re not trying to match income in dollars, rate of smoking in fractions of the population, percentage of black population in percentage points… etc

I also see that I’ve invented a more generalized model here than what is done. Typical synthetic control stuff seems to match only the outcome variable of interest, and only in the pre-treatment period (and, as is typically done in non Bayesian analyses, it would typically use a point estimate). That seems like a recipe for overfitting. Instead matching lots of outcomes, including some that are not plausibly affected by the treatment and using both the pre and post treatment period should produce a less over-fit model.

What I worry about with synthetic control groups is this: What if the very process of finding the weighted matches that most closely fit the pre-treatment trend simply IS overfitting? I.e., what if the size of the treatment effect you find is actually a measure of how much you overfit the pre-treatment period? Maybe the placebo tests handle this concern sufficiently, but how do we know?

Stuart:

For that matter, placebo tests are great in principle but they are regularly overinterpreted. Lots of cases where the placebo result is not statistically significant and then the researcher concludes that the main model is fine. If the researcher is not careful, this process runs into the “difference between significant and non-significant is not itself statistically significant” issue.

…that one took me a minute, but it is clever. Like when the placebo estimate is almost the exact same size as the real estimate, but it has a slightly bigger standard error and that makes the p-value 0.11, and thus the placebo estimate is 0 and the real estimate is the point estimate. I should probably watch more carefully for that.

In my reformulation above in terms of synthesizing the control to be “similar” on many dimensions, a zero-avoiding prior on the Error term is one way to prevent over-fitting. You KNOW that the control shouldn’t match very precisely in every measurement, so tell the model. Once renormalizing everything to be O(1) as recommended above, something like

sdErr ~ gamma(5,4/0.2);

which says that the sd of the error terms should be somewhere in the range 0.05 to 0.5 is a useful device to prevent overfitting. You might even go tighter, gamma(10,9/.2) which has high probability range 0.1 to 0.4 or so

Plus, the Bayesian posterior considers all the possibilities in the high probability range, and so does much less over-fitting than a point estimate.

Many are not that good at dealing with open ended-ness – there always being realistic possibilities of being misled that can’t be ruled out or have their “frequency” assessed/quantified. So they tend to downplay them.

For instance, near the end of this nice two minute animated explanation on Mendelian randomization by George Davey Smith – the “we don’t think so” comment regarding ruling out other effect pathways. George probably worries a lot about this, but I am expecting most researchers will try to down play this in their work and publications.

Its the standard “trick” of briefly and vaguely putting the limitations somewhere near the end of the paper. The most notorious one being by Yule as related by Freedamn “However, there is one deft footnote (number 25) that withdraws all causal claims: “Strictly speaking, for ‘due to’ read ‘associated with.’” https://web.stanford.edu/class/ed260/freedman521.pdf