The starting point is that we’ve seen a lot of talk about frivolous science, headline-bait such as the study that said that married women are more likely to vote for Mitt Romney when ovulating, or the study that said that girl-named hurricanes are more deadly than boy-named hurricanes, and at this point some of these studies are almost pre-debunked. Reporters are starting to realize that publication in Science or Nature or PNAS is, not only no guarantee of correctness but also no guarantee that a study is even reasonable.

But what I want to say here is that even serious research is subject to exaggeration and distortion, partly through the public relations machine and partly because of basic statistics. The push to find and publicize so-called statistically significant results leads to overestimation of effect sizes (type M errors), and crude default statistical models lead to broad claims of general effects based on data obtained from poor measurements and nonrepresentative samples.

One example we’ve discussed a lot is that claim of the effectiveness of early childhood intervention, based on a small-sample study from Jamaica. This study is *not* “junk science.” It’s a serious research project with real-world implications. But the results still got exaggerated. My point here is not to pick on those researchers. No, it’s the opposite: *even top researchers exaggerate in this way* so we should be concerned in general.

What to do here? I think we need to proceed on three tracks:

1. Think more carefully about data collection when designing these studies. Traditionally, design is all about sample size, not enough about measurement.

2. In the analysis, use Bayesian inference and multilevel modeling to partially pool estimated effect sizes, thus giving more stable and reasonable output.

3. When looking at the published literature, use some sort of Edlin factor to interpret the claims being made based on biased analyses.

The above remarks are general, indeed it was inspired by a discussion we had a few months ago about the design and analysis of psychology experiments, as I think there’s some misunderstanding in which people don’t see where assumptions are coming into various statistical analyses (see for example this comment).

I would cede that initially claims may be exaggerated for the reasons you cite in your comments above. But I also think that ‘regression to mean’ and ‘base rate fallacy have applied. We don’t want to accept it. Much learning is result of experiences [experimental] too. What has had sustained merit is the idea that more individualized attention during learning process has reaped benefits. Teacher’s attitude & engagement with child of course has different degrees of positive & negative effects. And I draw an analogy to some homeschooling efforts that yielded good test standardized test scores. What I’m suggesting that quality of engagement & individualized attention are integral to learning.

An open ended discussion is welcome.

I am not sure whether there are meta-analyses of early childhood interventions [I frown on that term ‘intervention’anyway]

Z seemed to have some pretty good points in that comment thread you linked to. I like how Z thinks.

U would.

Traditionally, design is all about sample size, not enough about measurement.The more I think about the myriad problems with modern (social and medical) science, the more convinced I am that measurement is crucially important. As you say, it relates directly to design, but it is not only a design issue. Measurement is also related to reliability and validity in well-documented ways, as well as research degrees of freedom/the garden of forking paths. For example, is walking speed a valid and reliable measure of “primed oldness”? Is it the only dependent variable that was measured as part of studies of “primed oldness”?

Measurement also has direct implications for model construction issues that get discussed here a lot. You have rightly pointed out that choices about a model’s likelihood function are, like priors, at least partially subjective and very important for a model’s performance. Decisions about measurement also play a key role in determining what likelihood function does or does not make sense in a particular model.

I just read an interesting paper (pdf) that argues pretty compellingly that other (related) aspects of measurement are central to how results get (mis-)interpreted. The point in this paper is related to but, I think, distinct from validity, reliability, researcher df, and how measurement (partially) determines likelihood functions.

I’ve been thinking the same for awhile, in fact that the “replication crisis” aspect is only the most obvious very tiny tip of the iceberg of problems. The vast majority is with misinterpreting measurements.

Eventually I came to see that there are too many alternate possibilities that can explain something vague like “increase/decrease in walking speed”. While in principle you could perhaps figure it out via different conditions, it is infeasible to run experiments that can distinguish between them all.

To check the explanation you simply need to get a quantitative prediction out of it (eg, “primed oldness theory says people in group A should walk ~2/3x as fast as group B”; “primed oldness theory says the variance of walking speed from day to day should be at least x%”; etc).

Put another way: observing results consistent with a theory that allows half of all possible values… is not a severe enough test of the theory. Another thing is that coming up with such predictions is not a statistical problem. The people looking for an answer there will not succeed.

Another thing is that coming up with such predictions is not a statistical problem. The people looking for an answer there will not succeed.This is a very important point, I believe. Statistical models will be useful (probably necessary) to evaluate how closely a particular data set and theoretical prediction correspond to one another, but the quantitative theory is prior to any confirmatory statistical modeling, though it may be usefully (though probably only partially) the product of exploratory statistical modeling.

“All models are wrong; some models are useful.” George E.P. Box :-)

I don’t see the relevance of that quote right here, can you explain?

Statistical models will be useful (probably necessary) to evaluate how closely a particular data set and theoretical prediction correspond to one another, but the quantitative theory is prior to any confirmatory statistical modeling, though it may be usefully (though probably only partially) the product of exploratory statistical modeling.Which findings have survived the Edin factor? If you have examples of studies that are exemplary where can we access them?

Sameera:

The Edlin factor is biggest for a one-time study. In our research we try to use internal replication where possible. You can see various examples on the published papers section of my website. (There are millions of examples from other researchers too; I just point to my own work first because that’s what I’m most familiar with.)

I’ll review them. Thank you much.

The Edin Factor link points to myriad of fallacies. Do they all comprise the Edin Factor? Do we pick & choose among them? How does that work? I speculate that hardly anyone has that the time & inclination to go through them.