There’s a paradigm in applied statistics that goes something like this:
1. There is a scientific or policy question of some theoretical or practical importance.
2. Researchers gather data on relevant outcomes and perform a statistical analysis, ideally leading to a clear conclusion (p less than 0.05, or a strong posterior distribution, or good predictive performance, or high reliability and validity, whatever).
3. This conclusion informs policy.
This paradigm has room for positive findings (for example, that a new program is statistically significantly better, or statistically significantly worse than what came before) or negative findings (data are inconclusive, further study is needed), even if negative findings seem less likely to make their way into the textbooks.
But what happens when step 2 simply isn’t possible. This came up a few years ago—nearly 10 years ago, now!—with the excellent paper by Donohue and Wolfers which explained why it’s just about impossible to use aggregate crime statistics to estimate the deterrent effect of the death penalty. But punishment policies still need to be set; we as a society just need to set these policies without the kind of clear evidence that one might like.
Another example, where the aggregate statistical evidence is even weaker (and, again, with no real prospect of improvement) was pointed out to me by sociologist Philip Cohen, who wrote:
In a (paywalled) article in the journal Family Relations, Alan Hawkins, Paul Amato, and Andrea Kinghorn, attempt to show that $600 million in marriage promotion money (taken from the welfare program!) has had beneficial effects at the population level. . . .
Cohen noticed a bunch of statistical problems with the published paper (see this recent entry on the sister blog for links and further discussion), but really the problem is much deeper than the flaws of one particular paper. It’s just going to be nearly impossible to learning much about the effects of such a program from aggregate state-level statistics (Cohen says that the paper looks at: percentage of the population that is married, divorced, children living with two parents, one parent, nonmarital births, poverty and near-poverty). There’s just no way.
That’s fine—I’m not saying that a new program shouldn’t be implemented or expanded, just because of lack of evidence. My point is that in cases such as this I think we need to discard the paradigm of steps 1, 2, 3 above. It could be possible to study effects via a more targeted analysis but I don’t think the aggregate thing tells us much of anything at all. But I think it can be difficult to talk about because of the pressure to demonstrate that a program “works.”