A few months ago I reacted (see further discussion in comments here) to a recent study on early childhood intervention, in which researchers Paul Gertler, James Heckman, Rodrigo Pinto, Arianna Zanolini, Christel Vermeerch, Susan Walker, Susan M. Chang, and Sally Grantham-McGregor estimated that a particular intervention on young children had raised incomes on young adults by 42%. I wrote:
Major decisions on education policy can turn on the statistical interpretation of small, idiosyncratic data sets — in this case, a study of 129 Jamaican children. . . . Overall, I have no reason to doubt the direction of the effect, namely, that psychosocial stimulation should be good. But I’m skeptical of the claim that income differed by 42%, due to the reason of the statistical significance filter. In section 2.3, the authors are doing lots of hypothesizing based on some comparisons being statistically significant and others being non-significant. There’s nothing wrong with speculation, but at some point you’re chasing noise and picking winners, which leads to overestimates of magnitudes of effects.
After seeing this, Aaron wrote to me:
It seems to me that there should be a standard rule of skepticism. It takes a point estimate and standard error for a significant effect (do you need sample size too) and divides point estimate by something and multiplies standard error by something to get a Posterior under the new principle of ignorance.
That is, suppose you don’t start with priors but I just tell you someone studied something and published a study saying something. What do you believe!
Then if you know something from past experience or think you do you can adjust further.
What is your posterior having read the study by Gertler and Heckman?
Yes, this is basic Bayes. The prior for any effect is centered at 0 with some moderate standard deviation. For example, most educational interventions don’t do much, some have small effects, very few have effects as large as 42%. But maybe Gerlter et al would not agree. And, from a political standpoint, it’s savvy for them to ignore me. [Here I was expressing frustration because I’d contacted Gertler, the first author of the aforementioned paper, several times regarding my concerns about his effect size estimate, and he did not respond.] If my article had been published in the New York Review of Books rather than Symposium magazine, maybe things would be different, but, then again, I doubt the New York Review of Books would be particulary interested in someone expressing skepticism on early childhood intervention. . . .
Ah. But I am suggesting an actual rule of thumb number that tells me how much to change them. (As a guess). Gelman’s contribution is the actual factors.
Bayes tells you how to do it in principle based on well specified priors. I want a rule of thumb I can carry with me and implement on the spot. Twenty years ago, I heard of a folk theorem called the “Iron Law of Econometrics.” It is that all estimated coefficients are biased toward 0 because of errors in variables. The name is possibly due to Hausman, though the idea is much older of course.
Indeed, there are several different reasons for effect size estimates to be overestimated. Above I mentioned the statistical significance filter and multiplicity (that is, researchers can choose the best among various possible comparisons to summarize their data); Aaron pointed out the ubiquity of measurement error; and there are a bunch of other concerns. For example, here’s Charles Murray:
The most famous evidence on behalf of early childhood intervention comes from the programs that Heckman describes, Perry Preschool and the Abecedarian Project. The samples were small. Perry Preschool had just 58 children in the treatment group and 65 in the control group, while Abecedarian had 57 children in the treatment group and 54 in the control group. In both cases the people who ran the program were also deeply involved in collecting and coding the evaluation data, and they were passionate advocates of early childhood intervention.
Murray continues with a description of an attempted replication, a larger study of 1000 children that reported minimal success, and concludes:
To me, the experience of early childhood intervention programs follows the familiar, discouraging pattern …small-scale experimental efforts staffed by highly motivated people show effects. When they are subject to well-designed large-scale replications, those promising signs attenuate and often evaporate altogether.
Heckman replies here. I’m not convinced by his reply on this particular issue, but of course he’s an expert in education research so his article is worth reading in any case.
In any case, I give Heckman credit for making predictions about effects of future programs. In a 2013 interview, he says:
They’re just now conducting evaluations of Educare, and I know because some of it is being conducted here in Chicago, by people I know and respect. Educare is based on a program that has been evaluated. It is an improvement on the Abecedarian program for which the results are highly favorable. Evidence from the Abecedarian program provides a solid “lower-bound” of what Educare’s results will probably be.
I’d guess the opposite. Given all the reasons above for suspecting that published results are overestimates, I’d guess that, in a fully controlled study with a preregistered analysis, the results for this new study would be less impressive than in the earlier published results. On the other hand, I don’t know anything about the details of any of these programs so in particular have no sense of how much Educare is an improvement on Abecedarian.
Heckman sees the earlier published results as a lower bound because he sees improvements in the interventions (which makes sense; people are working on these interventions and they should be getting better). I see the published results as an overestimate (but not an “upper bound,” because any estimate based on only 111 kids has got to be too noisy to be considered a “bound” in any sense) based on my generic understanding of open-ended statistical analyses.
What’s the Edlin factor?
To return to Aaron’s original question: is there a universal “Edlin factor” that we can use to scale down published estimates? For example, if we see “42%” should we read it as “21%”? That would be an Edlin factor of 1/2 (or, as the economists would say, “an elasticity of 0.5.”
But it doesn’t seem that any single scale-down factor would work. For example, when somebody-or-another published the claim that beautiful parents were 36% more likely to have girls, I’m pretty sure this was an overestimate of at least a factor of 100 (as well as being just about as likely or not to be in the wrong direction). Using an Edlin factor of 1/2 in that case would be taking the claim way too seriously. On the other hand, I’m pretty sure that, if we routinely scaled all published estimates by 1/2, we’d be a lot closer to the truth than we are now, using a default Edlin factor of 1.
Here’s another example where it would’ve helped to have an Edlin factor right from the start.
What do you all think?
P.S. We also need a better name than “Edlin factor.” I don’t like the word “shrinkage,” it sounds too jargony. Scale-down factor is kind of ok but I think we could come up with something better. The point is for researchers and consumers of research to routinely scale down published estimates, rather than taking them at face value and then later worrying about problems.
P.P.S. I sent the above to Edlin himself, who wrote:
Your refusal to name an Edlin factor seems decidedly non-Bayesian and a bit classical to me. I understand that with more information and context, the factor changes dramatically. But shouldn’t a Bayesian be willing to name a number rather than just say <1? (Also the commenter is correct that the Iron Law moves the opposite direction, which is what I was trying to say, btw. you know that of course. Not sure if that was clear or not in the blog.)
I replied: Maybe. But I think it’s also ok for a Bayesian to say that his prior depends on the problems he might be working on. A universal prior is some sort of average over all possible problems, but in that case a lot of precision might be gained by adding some context. I think an Edlin factor or 1/5 to 1/2 is probably my best guess for that Jamaica intervention example. But in other cases I’d give an Edlin factor of something like 1/100. And there are other settings where something close to 1.0 would make sense. Here I’m thinking of various studies that are repeated year after year and keep coming up with consistent results, for example correlations between income and voting.
Also, yeah, that Iron Law thing sounds horribly misleading. I’d not heard that particular term before, but I was aware of the misconception. I’ll wait on posting more about this now, as a colleague and I are already in the middle of a writing a paper on the topic.