Alison McCook from Retraction Watch interviewed Eric Loken and me regarding our recent article, “Measurement error and the replication crisis.” We talked about why traditional statistics are often counterproductive to research in the human sciences.

**Retraction Watch: Your article focuses on the “noise” that’s present in research studies. What is “noise” and how is it created during an experiment?**

Andrew Gelman: Noise is random error that interferes with our ability to observe a clear signal. It can have many forms, including sampling variability by using small samples, or unexplained error from unmeasured factors, or measurement error from poor instruments for the things you do want to measure. In everyday life we take measurement for granted – a pound of onions is a pound of onions. But in science, and maybe especially social science, we observe phenomena that vary from person to person, that are affected by multiple factors, and that aren’t transparent to measure (things like attitudes, dispositions, abilities). So our observations are much more variable.

Noise is all the variation that you don’t happen to be currently interested in. In psychology experiments, noise typically includes measurement error (for example, ask the same person the same question on two different days, and you can get two different answers, something that’s been well known in social science for many decades) and also variation among people.

**RW: In your article, you “caution against the fallacy of assuming that that which does not kill statistical significance makes it stronger.” What do you mean by that?**

AG: We blogged about the “What does not kill my statistical significance makes it stronger” fallacy here. As anyone who’s designed a study and gathered data can tell you, getting statistical significance is difficult. And we also know that noisy data and small sample sizes make statistical significance even harder to attain. So if you do get statistical significance under such inauspicious conditions, it’s tempting to think of this as even stronger evidence that you’ve found something real. This reasoning is erroneous, however. Statistically speaking, a statistical significant result obtained under highly noisy conditions is more likely to be an overestimate and can even be in the wrong direction. In short: a finding from a low-noise study can be informative, while the finding at the same significance level from a high-noise study is likely to be little more than . . . noise.

**RW: Which fields of research are most affected by this assumption, and the influence of noise?**

AG: The human sciences feature lots of variation among people, and difficulty of accurate measurements. So psychology, education, and also much of political science, economics, and sociology can have big issues with variation and measurement error. Not always — social science also deals in aggregates — but when you get to individual data, it’s easy for researchers to be fooled by noise — especially when they’re coming to their data with a research agenda, with the goal of wanting to find something statistically significant that can get published too.

We’re not experts in medical research but, from what we’ve heard, noise is a problem there too. The workings of the human body might not differ so much from person to person, but when effects are small and measurement is variability, researchers have to be careful. Any example where the outcome is binary — life or death, or recovery from disease or not — will be tough, because yes/no data are inherently variable when there’s no in-between state to measure.

A recent example from the news was the PACE study of treatments for chronic fatigue syndrome: there’s been lots of controversy about outcome measurements, statistical significance, and specific choices made in data processing and data analysis — but at the fundamental level this is a difficult problem because measures of success are noisy and are connected only weakly to the treatments and to researchers’ understanding of the disease or condition.

**RW: How do your arguments fit into discussions of replications — ie, the ongoing struggle to address why it’s so difficult to replicate previous findings?**

AG: When a result comes from little more than noise mining, it’s not likely to show up in a preregistered replication. I support the idea of replication if for no other reason than the potential for replication can keep researchers honest. Consider the strategy employed by some researchers of twisting their data this way and that in order to find a “p less than .05” result which, when draped in a catchy theory, can get published in a top journal and then get publicized on NPR, Gladwell, Ted talks, etc. The threat of replication changes the cost-benefit on this research strategy. The short- and medium-term benefits (publication, publicity, jobs for students) are still there, but there’s now the medium-term risk that someone will try to replicate and fail. And the more publicity your study gets, the more likely someone will notice and try that replication. That’s what happened with “power pose.” And, long-term, enough failed replications and not too many people outside the National Academy of Sciences and your publisher’s publicity department are going to think what you’re doing is even science.

That said, in many cases we are loath to recommend pre-registered replication. This is for two reasons: First, some studies look like pure noise. What’s the point of replicating a study that is, for statistical reasons, dead on arrival? Better to just move on. Second, suppose someone is studying something for which there is an underlying effect, but his or her measurements are so noisy, or the underlying phenomenon is so variable, that it is essentially undetectable given the existing research design. In that case, we think the appropriate solution is not to run the replication, which is unlikely to produce anything interesting (even if the replication is a “success” in having a statistically significant result, that result itself is likely to be a non-replicable fluke). It’s also not a good idea to run an experiment with much larger sample size (yes, this will reduce variance but it won’t get rid of bias in research design, for example when data-gatherers or coders know what they are looking for). The best solution is to step back and rethink the study design with a focus on control of variation.

**RW: Anything else you’d like to add?**

AG: In many ways, we think traditional statistics, with its context-free focus on distributions and inferences and tests, has been counterproductive to research the human sciences. Here’s the problem: A researcher does a small-N study with noisy measurements, in a setting with high variation. That’s not because the researcher’s a bad guy; there are good reasons for these choices: Small-N is faster, cheaper, and less of a burden on participants; noisy measurements are what happen if you take measurements on people and you’re not really really careful; and high variation is just the way things are for most outcomes of interest. So, the researcher does this study and, through careful analysis (what we might call p-hacking or the garden of forking paths), gets a statistically significant result. The natural attitude is then that noise was not such a problem; after all, the standard error was low enough that the observed result was detected. Thus, retroactively, the researcher decides that the study was just fine. Then, when it does not replicate, lots of scrambling and desperate explanations. But the problem — the original sin, as it were –was the high noise level. It turns out that the attainment of statistical significance cannot and should not be taken as retroactive evidence that a study’s design was efficient for research purposes. And that’s where the “What does not kill my statistical significance makes it stronger” fallacy comes back in.

**P.S.** Discussion in comments below reveals some ambiguity in the term “measurement error” so for convenience I’ll point to the Wikipedia definition which pretty much captures what Eric and I are talking about:

Observational error (or measurement error) is the difference between a measured value of quantity and its true value. In statistics, an error is not a “mistake”. Variability is an inherent part of things being measured and of the measurement process. Measurement errors can be divided into two components: random error and systematic error. Random errors are errors in measurement that lead to measurable values being inconsistent when repeated measures of a constant attribute or quantity are taken. Systematic errors are errors that are not determined by chance but are introduced by an inaccuracy (as of observation or measurement) inherent in the system. Systematic error may also refer to an error having a nonzero mean, so that its effect is not reduced when observations are averaged.

In addition, many of the points we make regarding measurement error also apply to variation. Indeed, it’s not generally possible to draw a sharp line between variation and measurement error. This can be seen from the definition on Wikipedia: “Observational error (or measurement error) is the difference between a measured value of quantity and its true value.” The demarcation between variation and measurement error depends on how the true value is defined. For example, suppose you define the true value as a person’s average daily consumption of fish over a period of a year. Then if you ask someone how much fish they ate yesterday, this could be a precise measurement of yesterday’s fish consumption but a noisy measure of their average daily consumption over the year.

**P.P.S.** Here is some R code related to our paper.

Great article (and interview!). By coincidence, I was reading an article this morning in which the author claimed that the effect he had found was probably a conservative estimate because sampling error was stealing some of his good signal (N = 90).

I have a request. The right-hand panel (“Effect of sample size”) of the article has a rather zoomed-out X axis. I would be really interested in seeing this with sample sizes ranging from 0 to, say, 200. At the moment it is hard to see whether, for example, the point at which half of all studies have inflated estimates due to error is with N=40 or N=100, because the slope is so vertiginous with the current axes. So either a zoomed-in chart, or a table of critical values, would be useful. (Maybe the code is somewhere, but I don’t see a link.)

Nick:

Could you share the quote from this article to which you refer?

Gahhh, I replied to the main thread and not your comment. See below.

“In both the path analyses and the hierarchical regression analyses, the observed data were used to test the hypotheses. Due to the inclusion of random error in the model, however, the results may underestimate the magnitude of the relationships-especially the influence of moderator variables (Busemeyer & Jones, 1983; Evans, 1985). Therefore, the analyses presented here should be considered conservative because this unreliability makes it more difficult to reject the null hypothesis.” (p. 1834)

http://onlinelibrary.wiley.com/doi/10.1111/j.1559-1816.1993.tb01068.x/abstract

Isn’t that just attenuation error? If measured regressors are the real quantity plus white noise uncorrelated with everything else, then estimates of their coefficients are downwards biased.

Researchers love selling measurement error as white noise because when they get a statistically significant coefficient, they get to say that it is likely a “lower bound” for the real effect due to the estimate being subject to attenuation bias

Anon:

Yup, that’s the issue. See here for an example of where economist James Heckman writes, “Evidence from the Abecedarian program provides a solid ‘lower-bound’ of what Educare’s results will probably be.”

Measurement error as white noise is the most silly thing people teach in econometrics, and students come out of the class thinking measurement errors are harmless.

Jack, when I took my first graduate-level econometrics class we were using Jeffrey Wooldridge’s book “Introductory econometrics” which contains sections on measurement error (I think the book targets advanced undergraduates). No student having read it could come away thinking that measurement error is harmless. Wooldridge clearly argues that the effect of measurement error depends on (1) where the error is (in the dependent or in an independent variable) and on (2) what type the error is (e.g., classical measurement error, aka as white noise; or it could be of a different type).

I think it is important to consider the potentially non-uniform distribution of measurement error (ME) throughout population subgroups. This is especially important in situations where the distribution of effects in question is distributed in a highly non-uniform manner across subgroups as well.

For example, in air pollution studies one may fit a Bayesian spatial model which estimates exposure effects on health for a large variety of small regions (e.g. neighborhoods). The effects themselves are usually not disturbed uniformly, as some subgroups may be more susceptible to the harmful effects of pollution than other based on various lifestyle factors.

We can incorporate measurement error of exposures in the above scenario. However, due to irregular distribution of monitors used to estimate regional exposures, some regions may be associated with high measurement error, while others may be associated with lower error. Suppose that areas containing strong estimated exposure/health effects are associated with low ME, while regions with low exposure/health effects are associated with high ME. In this case, when estimating the overall region-wide exposure effect, the areas with strong effects will effectively be up-weighted, while regions with weaker effects will be down-weighted, potentially leading to a stronger overall, area-wide estimated exposure effect. Of course such estimations can go the other way as well.

The point is that health effects of exposures are often non-uniformly distributed across a wide geographical area, as are the magnitude of measurement errors. The interplay between these two distributions is important to keep in mind when considering overall effect measurement error has on the overall level of statistical significance.

Nice example.

An investigator and I recently submitted results of a very small (pilot) study of the effects of a medication commonly used off-label to treat symptoms of a particular chronic illness. The raw data shows a deleterious of the medication compared to placebo. A Bayesian hierarchical model of the multivariate responses gave a posterior probability of about 80% that the medication is worse than placebo, even with priors on the effects that support a beneficial effect of the drug.

Here’s the thing: The journal rejected the paper because the study was “underpowered”. However, the journal accepted the paper in reduced form as a letter. Why is that? The conclusions don’t change. It’s just that the small sample transforms the outcome from being a “Research Article” worthy result to only a “Letter” worthy result, thought the results are the same.

I had a study once that showed a adverse impact on one of several measures that we looked at in addition to the “official” outcome specifically to understand descriptively if there could be displacement/halo/diffusion/backfire impacts. A little digging around in other studies found hints of the same thing. We actually were forced by reviewers to take it a mention of this as a suggestive and potentially important result out of the article because it was “not statistically significant.” Nowadays the whole thing would (I hope) be analyzed differently, but it worries me still.

While I agree that a lot of researchers will see it as a “threat”, this isn’t supposed to be the case. The threat is actually that no one will care enough about your work to fund/perform a replication (and thus you wasted your time).

A replication is more of an honor. It is a reward for doing a good job. If you think about it, replication = “more data”. So, it is actually data/evidence that is threatening to these people.

One of the things that my former advisor pointed out in a paper (What is Wrong with Psychology?) is that reliance on NHST frees the researcher from responsibility for irreproducible “effects.” The notion is that the person can always say “Hey…p was <.05…it ain’t my fault!”

As I have said elsewhere, single-subject designs should be used wherever possible. Where they are not applicable, the researcher has the responsibility of making the case for an effect* (or no effect, as the case may be) and the reader has the responsibility of judging the data (maybe after other analyses on the data).

*The researcher using SSDs must make a case as well but when one has several replications within- and between-subject, the task is easy. Not many people in, for example, what is now called behavior analysis, lose much sleep over whether an effect they claim is real will be replicated. Behavior analysis has little problem with failures to replicate – but, of course, it relies on actual experimental control of its subject matter, as opposed to statistical “control.” Again, though, I realize that some people are interested in questions for which SSDs are inapplicable.

Did not seem to work well in clinical research in the past – http://andrewgelman.com/2012/02/12/meta-analysis-game-theory-and-incentives-to-do-replicable-research/

Perhaps today in the human sciences it will (clinical trials take many years to conduct and report).

Wow, Andrew. I’m so grateful that we have you! Thanks again for the enormous amount of time that you devote to this blog.

I think there must be a point about the distributional/modeling assumptions of the “noise”. Noise is what we don’t know about, it’s an informational issue, it’s the parts of the modeling we didn’t bother to model. I think one of the worst problems in current practice is that researchers think noise is a physical objective phenomena and that traditional classical statistical methods model it accurately by default, without careful thinking.

“I think one of the worst problems in current practice is that researchers think noise is a physical objective phenomena and that traditional classical statistical methods model it accurately by default, without careful thinking.”

+1

Do you have the code to replicate your simulation?

Yes, this would be great. Then I could make my own zoomed-in chart!

“Noise is all the variation that you don’t happen to be currently interested in.”

I wonder what would happen if we just replaced the phrase “error term” with “unobservable determinants, measurement error and model misspecification” when we taught statistical methods. Do you think that would change the way students think about regression models? Or would they just learn a stylized framework under which those components of the error term are orthogonal to the variable of interest, and then string together a series of platitudes posing as an argument that their particular situation fits that framework?

I really think that language matters here. Because when I see people wildly over-interpret regression coefficients, I think that they must be taking regression equations “literally” in some way that, if they actually thought about it, they would know is somewhere between disingenuous and insane. And then I react to that and tell my students “the error term is what makes you a person… it is the statistical representation of your individual humanity”. And then my students look at me like I’m crazy. And that can’t be good for my teaching evaluations (or, I guess, the future of social science).

Maybe replace “error term” with “unexplained variation term”? or

“uncertainty term”? (“Error” is indeed a really unfortunate choice of terminology here, since many people read it as “mistake”.)

But all the terms are uncertain, and much of the variation is unexplained, even if it is modeled.

How about “Other uncertainty?”

Good points. But I think “remaining uncertainty” would be better than “other uncertainty”.

I think I would enjoy taking one of your classes :)

I had brunch with jrc a couple weeks back and it was the best thing that happened all month.

That is probably the saddest thing I’ve ever heard. And I’ve done research on dead babies….

But we should do that again – as soon as I’ve done a little homework on second-earner labor supply. Spring break (woooooo!!!1!) ?

And how to hedge against the coming cashpocalypse.

First Observation: I think the cat replicates well.

If you zoom in, you’ll see the weight is different on every scale.

I clicked on all 4, and they all show 110.10, url is the same as well.

Opps, no it’s 11.010

AG: We blogged about the “What does not kill my statistical significance makes it stronger” fallacy here. As anyone who’s designed a study and gathered data can tell you, getting statistical significance is difficult. And we also know that noisy data and small sample sizes make statistical significance even harder to attain. So if you do get statistical significance under such inauspicious conditions, it’s tempting to think of this as even stronger evidence that you’ve found something real. This reasoning is erroneous, however. Statistically speaking, a statistical significant result obtained under highly noisy conditions is more likely to be an overestimate and can even be in the wrong direction. In short: a finding from a low-noise study can be informative, while the finding at the same significance level from a high-noise study is likely to be little more than . . . noise.

I would love to know how this relates to the Mayo-Spanos concept of severity, in particular the analogy constantly appearing that if a insensitive fire alarm gives an alarm, it is a much greater indication of a fire than when a highly sensitive one does. At first glance, this seems analogous to a noisy/non-noisy experiment.

I totally agree that the social sciences and social psychology in particular have a replication crisis. 25% successful replications in OSC reproducibility project. Ouch!!!

However, I have some problems with the Science article.

First, I do not see how random measurement error is a major contributor to the replication crisis in psychology. Bem’s infamous ESP experiments could not be replicated, but that was not a problem of random measurement error in his outcome measure (binary predictions of location of erotic pictures). Is there any evidence that a replication failure occurred because the original measures were unreliable?

Second, I had a hard time understanding (and failed) this sentence.

“It is a common mistake to take a t-ratio as a measure of strength of evidence” How is a t-ratio not a measure of strength of evidence against the null-hypothesis?

“…and conclude that just because an estimate is statistically significant, the signal-to-noise level is high” doesn’t this reverse the logic of significance testing, we say that a result is significant if the signal-to-noise ratio is high, we don’t say the signal-to-noise ratio is high because a result is significant.

It would be great if you could clarify this sentence.

To clarify my position. I am all for honest reporting of results, but I do think that t-values provide valuable information. In the OSC-reproducibility project, t-values greater than 4, replicated with over 80% probability. t-values around 2 (just significant) replicated with 20% probability.

Ulrich:

1. A key problem with Bem’s ESP experiments were that his measurements are so noisy. ESP is kind of a weird example here because it’s not clear that any such phenomenon even exists—but, to the extent that ESP does exist, Bem’s measurements are indirect and super-noisy. When measurements are noisy, conclusions become much more sensitive to small biases (such as can arise, for example, from information leakage in an experiment) as well as researcher degrees of freedom (which were out of control in Bem’s paper). Perhaps this would clearer if instead of thinking of Bem, you think about something like the ovulation-and-voting study, in which researchers used a very noisy measure of fecundity, which makes anything that comes of the study much less interpretable, for the reasons discussed in the paper by Loken and myself.

2. I do not think a t-ratio is a good measure of strength of evidence. Here’s the full sentence from our article: “It is a common mistake to take a t-ratio as a measure of strength of evidence and conclude that just because an estimate is statistically significant, the signal-to-noise level is high.” The problem is that the numerator of the t-ratio is itself just a noisy number and isn’t much good on its own as an estimate of the signal. Consider the notorious example of that study of beauty and sex ratio where the researcher observed a difference of 8 percentage points in sex ratio, comparing children of beautiful and less attractive parents. This difference had a standard error of 3 percentage points. The ratio is over 2, implying (under the usual rule followed by the Journal of Theoretical Biology and Psychological Science and PPNAS and zillions of other journals) that the evidence is strong. But that’s not correct at all. This 8 percentage point number supplied just about no evidence at all.

Let’s switch from t to z-scores to make values for different sample sizes comparable.

Would you say that there is no meaningful difference between a z-score of 2 and a z-score of 4? These z-scores are significantly different from each other. Why would we not say that a study with a z-score of 4 provides stronger evidence for an effect than a study with a z-score of 2?

Ulrich:

Sure, fair enough. The z-score provides some information. I guess I’d just say it provides less information than people think. It all depends on the selection process. There’s selection for |z| > 2, so values of |z| near 2 tell us just about nothing. Values such as |z| = 4 are more informative because 4 is more than what is needed in the selection process.

Regarding 2, the Loken and Gelman article didn’t say that a t-ratio not a measure of strength of evidence *against the null-hypothesis*. I suspect that what was meant was that a t-ratio is not a measure of the strength of evidence *for the specific alternative* (one’s favored hypothesis). At least in psycholinguistics, when the t-value is very high, people think they have found strong evidence *for the specific hypothesis*. They have only found evidence against the null. Also, even though one has found evidence against the null, nothing entails that the null is necessarily false. The conversation quickly shifts to the effect being “reliable” or “not due to chance”, which makes people overconfident. As Andrew puts it, on top of that you have the statistical significance filter; all kinds of monkeying around (done consciously or unconsciously) got you the high t-value, makes it less believable. Recently I saw a published estimate that was 10 times larger than that of other comparable studies, and when I looked at the raw data I found that the author had removed more than half the items from the data-set after finding no effect. Once you remove half the data points, you get a gigantic effect, highly significant. The paper never mentions that half the items were removed. There’s lots of post-hoc reasoning like this that goes unreported in papers.

Frank Harrell also has a lot to say on this point.

Ulrich

with regard to the first comment, it sounds like you said “Study X didn’t replicate and it wasn’t because of measurement error, so measurement error doesn’t relate to replication.” Our point was that sometimes Study Z finds a result, and then the researcher argues “But because we had to overcome all this extraneous error (or whatever flaw has been pointed out to them) the true effect must be even bigger.” We just show the assumption of attenuation doesn’t automatically hold in small studies that are low power to begin with. The contribution to poor science is that there might be over-enthusiasm for the outcome of specious results (based on the flawed reasoning), and an unreasonable expectation of the true effect size and hence what kind of study is needed to replicate. With non-trivial frequency, the “better” experiment had produced a weaker, not stronger, effect than the one observed.

This only holds for selection for significance, otherwise the stronger effect size in the smaller sample would not be significant.

Also, if we don’t focus on the point estimate of the effect size, which is silly in small samples, and use the 95%CI around it, the lower bound of the 95%CI will be higher in the larger sample with the lower point estimate than in the smaller sample with the larger point estimate because the sampling error in the smaller sample is larger.

I still don’t see how more random measurement error (low reliability) in a measure is supposed to make things worse in the smaller sample. Random measurement error will still attenuate the observed effect size. So a p-hacked study with a reliable measure should produce stronger effect sizes than a p-hacked study with an unreliable measure.

Ulrich:

1. You write, “this only holds for selection for significance.” But “selection for significance” is the norm in just about all scientific fields in which empirical results are presented based on statistical analysis. Papers in poli sci, econ, psychology, biology, medicine, etc. will present non-statistically-significant results, but almost always as secondary findings; the key result (whatever it is) of the paper is just about always statistically significant. That’s the world we live in.

2. I agree that focusing on the point estimate of the effect size is silly. But that’s what people do. It’s what’s done by top researchers in the top journals. When Gertler et al. came up with an estimated effect of early childhood intervention of 42% on earnings with a standard error of something like 20%, the reported the result as 42% and statistically significance. Not as 2%. (Actually the lower bound of the interval can have problems too but that’s another story.) Again, this is the world we live in.

3. You write, “a p-hacked study with a reliable measure should produce stronger effect sizes than a p-hacked study with an unreliable measure.” I’m not quite sure what you mean here, but I’ll assume that, by “produce stronger effect sizes,” you mean, “produce larger estimates of effect sizes.” If so, I disagree. Consider two studies of a phenomenon with underlying effect size of somewhere in the range (-1, 1). The first study has a reliable measure and gives standard error of 1. In that case the estimated effect, in order to be published, must be at least 2 in absolute value, thus an exaggeration factor (type M error) of at least 2. The second study has an unreliable measure and gives standard error of 10. In that case the estimated effect, in order to be published, must be at least 20 in absolute value, thus an exaggeration factor (type M error) of at least 20. The study with the unreliable measure will, on average, give estimates that are larger in absolute value.

I think the confusion can be resolved if we distinguish standardized effect sizes (Cohen’s d) which is the norm in psychology and unstandardized effect sizes, which may be the norm in other social sciences.

Random measurement error will increase the Standard Deviation. If we compute standardized effect sizes, the 10 point difference with SD 20 is no larger than the 1 point diference with SD 2.

So, we could both be right, if you are talking about unstandardized effect sizes and I am talking about standardized effect sizes.

???

“How is a t-ratio not a measure of strength of evidence against the null-hypothesis?”

GS: Isn’t this essentially asking how is a p-value not “…a measure of strength of evidence against the null-hypothesis.” Am I missing something? It’s quite possible.

I ran my own simulation.

Scenario. the true correlation is r = .30 and the sample size is small (N = 30). Even with a perfectly reliable measure, the study has low power (power = 37%).

I simulated differences in reliabilty in increments from 1 to .10.

The ‘true’ population correlation for the 10 measures is given by pop.r * sqrt(rel).

0.30 0.28 0.27 0.25 0.23 0.21 0.19 0.16 0.13 0.09

We see that observed correlations underestimate the population correlation. With the most unreliable measure, the correlation of .09 has only 8% power to produce a significant result.

Next I simulated actual studies with N = 30. I computed the average correlation (no Fisher-transformation) and converted the p-value into a z-score as a measure of strength of evidence.

[1,] 1.0 1.18 0.29

[2,] 0.9 1.07 0.28

[3,] 0.8 0.98 0.26

[4,] 0.7 0.86 0.24

[5,] 0.6 0.73 0.22

[6,] 0.5 0.67 0.21

[7,] 0.4 0.51 0.19

[8,] 0.3 0.41 0.16

[9,] 0.2 0.28 0.13

[10,] 0.1 0.11 0.09

As expected, the average observed effect size and the strength of evidence shrinks with descreasing reliability.

This is of course just standard statistics.

Now we get to the new world of p-hacked statistics. The easiest way to simulate p-hacking is to select for significance.

I therefore, also computed the average z-score and observed effect size for the subset of studies that were significant (recall, for reliablity of .10 this is only 8% of all studies, p-hacking severely underpowered studies is hard).

[1,] 1.0 2.58 0.50

[2,] 0.9 2.55 0.49

[3,] 0.8 2.52 0.49

[4,] 0.7 2.50 0.49

[5,] 0.6 2.49 0.49

[6,] 0.5 2.51 0.48

[7,] 0.4 2.43 0.48

[8,] 0.3 2.42 0.46

[9,] 0.2 2.41 0.46

[10,] 0.1 2.41 0.35

The results show that p-hacking inflates the strength of evidence for an effect and the observed effect size.

This is well-known and has been the focus of discussions about replicability. Replication studies will not reproduce these inflated effect sizes and are likely to produce non-significant results (see, OSC, Science, 2015) for evidence.

The new question raised in this Science article is whether unreliable, noisy MEASURES in combination with p-hacking produce even more inflated effect sizes estimates. The results presented here say that this is not the case. Even reliable measures benefit from inflation by selection for significance and the greater benefit for unreliable measures does not compensate for the reduction in effect size estimates due to the noise in an unreliable measures. The most reliable measures shows the most inflation.

So, I remain puzzled by the suggestion that using unreliable measures can help p-hackers to produce even more inflated effect sizes than they could obtain by p-hacking studies with reliable measures.

set more off

tempfile testdata

*Run 100 experiments

forvalues i = 1/100 {

*Each experiment has 100 observations (50 each group)

set ob 100

*Generating random treatment variable

gen rand = rnormal()

sort rand

gen ob = _n

gen T = ob2

*Estimates for the second outcome

reg Y2 T

gen tstat2 = _b[T]/_se[T]

gen Beta2 = _b[T]

gen sig2 = tstat2>2

*Saving each fake experiment

contract Beta* tstat* sig*

cap append using `testdata’

save `testdata’, replace

}

*Showing results

sum Beta1 Beta2

sum Beta1 if sig1==1

*This shows 30 significant experiments, with a mean treatment effect estimate of 0.5 (conditional on being significant)

sum Beta2 if sig2==1

*This shows 15 significant experiments, with a mean treatment effect estimate of 0.94 (conditional on being significant)

Ulrich:

You write that you “remain puzzled by the suggestion that using unreliable measures can help p-hackers to produce even more inflated effect sizes than they could obtain by p-hacking studies with reliable measures.”

Here’s an example from my earlier comment; no simulation required: Consider two studies of a phenomenon with underlying effect size of somewhere in the range (-1, 1). The first study has a reliable measure and gives standard error of 1. In that case the estimated effect, in order to be published, must be at least 2 in absolute value, thus an exaggeration factor (type M error) of at least 2. The second study has an unreliable measure and gives standard error of 10. In that case the estimated effect, in order to be published, must be at least 20 in absolute value, thus an exaggeration factor (type M error) of at least 20. The study with the unreliable measure will, on average, give estimates that are larger in absolute value.

The point is not that researchers are trying to produce inflated estimates of effect sizes; it’s that if the standard error is large, then the published estimate (which won’t be published unless it’s statistically significant) is

necessarilylarge in absolute value. Hence, if a researcher such as Heckman or whoever runs a noisy study and produces an estimate that is large and statistically significant, it is a mistake for him to think that, had the study been done with more precision, that the estimate would’ve been larger. It is a mistake to think that published estimates under noisy conditions are underestimates.I hope this example removes your puzzlement.

This is likewise vividly illustrated by the “small schools” fallacy, right? By which I mean: if you analyze schools that have the highest average test scores, you will find a lot of small schools. But that may just be because the variance of the mean over a small sample is greater than that of a larger one. One might also find a lot of small schools among the worst performers, and no difference in the overall means. A philanthropist would thus be ill advised to spend a couple billion dollars on breaking up larger schools into smaller ones. But it is too late to close the gates on that horse.

This is a different filter (“top-n” instead of “p < .05") but mutatis mutandis etc.. QED.

Measurement error and

the replication crisis

The assumption that measurement error always

reduces effect sizes is false

Can you define measurement error for me. I don’t see any reference to reliability of a measure in your example.

Are we really talking about measurement error or is measurement error in political science different from measurement error in psychological science?

Ulrich:

Wikipedia has a pretty good definition and discussion here: https://en.wikipedia.org/wiki/Observational_error

In addition, many of the points we make regarding measurement error also apply to variation. Indeed, it’s not generally possible to draw a sharp line between variation and measurement error. This can be seen from the definition on Wikipedia: “Observational error (or measurement error) is the difference between a measured value of quantity and its true value.” The demarcation between variation and measurement error depends on how the true value is defined. For example, suppose you define the true value as a person’s average daily consumption of fish over a period of a year. Then if you ask someone how much fish they ate yesterday, this could be a precise measurement of yesterday’s fish consumption but a noisy measure of their average daily consumption over the year.

Finally, I don’t know if this is on purpose—it could be a matter of writing style—but from my standpoint, your comments on this thread are looking pretty aggressive and confrontational. You can take whatever tack you like, but I recommend that you start from the presumption that Eric Loken and I know what we’re doing! That might make your exploration of these areas more fruitful.

Hi Andrew,

It is ironic to appeal to expertise in a conversation about the replicability crisis.

Clearly some of the experts in my science, psychology, made some big mistakes that lead to a big mess, that you have frequently commented on.

My repeated questions may appear aggressive, but I think there is a problem and you are not addressing it.

I posted results that show how reliability influences observed effect sizes and that even selection for significance in small samples does not produce more inflated estimates than reliable measures. I haven’t seen you commenting on this result. If you are doing everything right, what am I doing wrong?

I don’t find the wikipedia entry very useful. In my science, psychology, measurement error is well defined and is routinely estimated with Chronbach’s alpha or short-term retest correlations. In psychology, everybody uses the term measurement error as the variance in a measure that is unreliable.

I am simply asking whether you are using the term in the same sense or whether you have a different definition of measurement error. Maybe political science uses different terminology.

Finally, I would like to point out that measurement error is not a serious problem in psychology. Many failed replication studies used reliable outcome measures. So, even if your simulations do show that unreliable measures produce inflated effect sizes, this is not a key factor in the replicability crisis in psychology.

Sorry, if my disagreement is perceived as aggressive, but I like to get to the bottom of problems and I do have a problem with the suggestion that measurement error has much to do with the replication crisis in psychology. As I show in my simulations, measurement error attenuates effect sizes and makes it harder to get significance. P-hacking (selecting for significance) inflates observed effect sizes for reliable and unreliable measures. I just don’t see that unreliable measures produce more inflation than reliable ones.

Ulrich:

I’m not appealing to expertise. As I wrote, you can take whatever tack you like. I just think it will be more of a learning experience for you if you start from the presumption that Eric and I know what we are doing. This is just my advice to you.

In answer to your questions:

1. I don’t know what you are doing wrong, as I have not put in the effort to understand your simple code. I have demonstrated the problem with a simple example that requires no simulation.

2. Psychology is a diverse science. It is possible that in your subfield of psychology, measurement error is not a serious problem. But the field of psychology also includes just about everything published in the journal Psychological Science, where I’ve seen more than one article about fertility and behavior in which fertility was measured with a lot of error, and in which the behavioral outcomes had a lot of variability (which, as I discussed in my comment above, can create the same problems as measurement error regarding biases in estimation).

3. Regarding your very last sentence, I recommend you reread the example I gave in my above comment.

I think I get the problem. Here it is in data generating language.

Y = Observed_X + epsilon

Observed_X is “real X” plus noise – this is what Ulrich means by measurement error

Epsilon is “noise” – this is what Andrew means by measurement error

I think that reconciles your results. You could adapt my code above (Ulrich never provided code) to make that mimic his results. In my version above, you increase the variance of the error term. In his version, you increase the variance in the deviation of “observed X” and “real X” – so instead of changing epsilon, you regress on an “observed” value of X that is the real X plus some “noise” (and you generate Y using the real X).

Yeah?

Jrc:

That must be part of it. I will just elaborate to say that one can have measurement error (or variation, which, as discussed above, is not clearly distinct from measurement error) in both x and y. I agree that we did not make this clear in our paper (I blame space limitations!).

Also I think it’s odd that Ulrich said that measurement error is not a serious problem in psychology. But I guess things vary by subfield, and as a statistician I’m particularly aware of areas where measurement error and variation are important.

I don’t get the “Space Limitations” excuse. As it is, most times I read a paper I come out feeling it was too long rather than too short.

Even if we gave authors more space readers have limited time; just coz’ an author write more doesn’t mean a reader will read more. The trick is going to be better, more concise writing, and not making papers even longer than they currently are.

In any case, the signal to noise ratio of an academic paper isn’t very high. I’m sure we could do a lot more to write more pithy, compact articles.

Getting rid of the constraints of word limits would be a move in the absolutely wrong direction.

Rahul:

My “space limitations” comment was a bit of a joke. But in this particular article we really did have to cut stuff to stay within the length restrictions.

P.S. As someone who writes 400+ blog posts a year, I would prefer not to be restricted in my words.

@Andrew:

I think blogs are a great medium. Yours is one of the best.

But blogs are such a good medium, not *because* of the lack of space constraint, but in spite of it. Prolific blogging is different from long, lengthy blog posts.

The true advantages of blogs lie in other attributes, e.g. the responsiveness, interactivity, high-quality commenting, anonymity, easy hyperlinking etc. But I doubt I’d put “lack of word limit” on this list.

In fact, some of the best blog posts I recall were very short ones. In any case, the average blog post is much shorter than an average academic paper.

Wait… measurement error means measurement error on the covariates… there’s no other meaningful interpretation, because we always have the error term for the Y regression. And indeed this is the context for Heckman’s comments, if you’re using a different meaning this doesn’t make sense and I have never seen anyone using this term in a different way.

In equations, it means:

X* = X + ex

Y = X* + ey

where X* is X measured with error “ex”.

Now, it’s not clear what you article really posits. If you’re saying that measurement error can help getting even more exaggeration, this is clearly untrue, as Ulrich says. If you’re saying that measurement error is not enough to claim that your effect is underestimated, then this is clearly true. But these are two different things.

Andrew, if you could share your code it would be easier for us to understand exactly what you mean.

@Andrew

I’ve no dog in this fight but it was ironic reading you write this:

>>>your comments on this thread are looking pretty aggressive and confrontational<<<

These were just comments, but don't you think that sometimes entire blog posts you write may come across like that to the researchers you criticize?

I'm not at all saying you shouldn't. Merely that I found the transposition of roles funny.

John:

1. Heckman’s quote is “The fact that samples are small works against finding any effects for the programs, much less the statistically significant and substantial effects that have been found.” Sample size is relevant for problems with measurement error in x or in y.

2. You write, “measurement error means measurement error on the covariates… there’s no other meaningful interpretation . . . if you’re using a different meaning this doesn’t make sense and I have never seen anyone using this term in a different way.” All I can say, is you may have never seen this, but your experience is hardly universal. I refer you to the definition on the wikipedia entry. I agree that the term “measurement-error models” in regression typically refers to examples with measurement error (or variation) in both x and y, but “measurement error” is a more general expression that does not require regression at all.

Rahul:

Confrontation can make sense; it depends on the context. My remark on this to Ulrich was advice on the lines of: You might want to bring an umbrella if you’re taking a walk in Seattle in the winter. I appreciate Ulrich’s comments and he could well have a valuable point to make; I just thought it would serve him well to start from the position that Eric and I know what we’re doing, and then go from there. This si just a piece of prior information that could make his inferences more useful.

andrew,

space limitations in the article are fine. here we have unlimited space to resolve an issue. maybe you need to consider the posibility that I also know what I am talking about. Just like frequentists and bayesians are different approaches, neither one is right or wrong, we may both have some insights about replicabiltiy. I don’t write 400 blog posts a year, but I have written a few that might be worth your while.

https://replicationindex.wordpress.com/2017/02/02/reconstruction-of-a-train-wreck-how-priming-research-went-of-the-rails/comment-page-1/

Psychologists are not trained psychometricians, but most learn about internal consistency and think about random measurement error. Look into a method section in a psych article, and you will typically find some information about the reliability of a measure and typically this estimate is greater than .70. There you have it, typically less than 30% of the variance is measurement error.

I can only repeat that you may have a different definition of measurement error, but in psychology only the unshared variance among items or retests is considered measurement error.

I would say this is a broadly shared meaning of measurement error in psychology.

“Psychologists are not trained psychometricians, but most learn about internal consistency and think about random measurement error.”

I was going to write a longer response to your set of comments, but I think this statement of yours sums up any difference of opinion I might have with you, not that any others would agree. What you propose as a feature, I see as the source of the problem. Stats and modeling is hard. Whether you call it “math psych” or “psychometrics” or anything else, I think it’s unfortunate that understanding the mathematical tools and theory that can provide solid inferential evidence for psychological effects has been sidelined to a rare specialty, that most psych departments don’t even offer. This is even more disappointing to me considering the pioneering work that psychology played in the development of statistical methods, precisely because measurement _is_ such a difficult problem in understanding psychological phenomena.

Ulrich:

Your view of psychology research seems to be more positive than mine. I suspect that’s because we’re looking at different subfields within psychology. Regarding your point above, I will emphasized that error includes both bias and variance; a measure can have 100% reliability and still have a big error. For example suppose (as has happened) a researcher counts days since last menstrual period and declares a woman at peak fertility during days 6-14. This estimate is biased (as peak fertility occurs, on average, in days 10-17) and also variable (because peak fertility varies a lot among women). So even if the survey question is 100% reliable (in that the woman, if asked again, would give the same answer for date of last menstrual period), it has huge measurement error. And this has big consequences on inferences, because effect estimates based on this measure become super noisy. Eric and I discussed this in our earlier paper, and the researchers in this subfield really don’t seem to realize the problem. They seem to think that it’s OK to use a biased, noisy measure, as long as they come up with something statistically significant at the end of the day. And this connects us to the “‘What does not kill my statistical significance makes me stronger’ fallacy,” the mistaken attitude that this measurement error is not a problem (or even is a sign of strength) conditional on statistical significance.

Regarding definitions: I think the concept of measurement error is crucial to statistics as it is used in many aspects of science, and I think my coauthor feels the same way. And he has a background in psychometrics, and i’ve published widely in the field. That said, the audience for our Science paper was not psychometricians or even psychologists, it was scientists and statisticians in general.

In writing the paper, we had not fully thought through the different meanings that people hold for the term “measurement error,” and in retrospect I wish we’d clarified this in the paper. As is often the case, the concepts that we think about the most can be the most difficult to define—it was so clear to us what

wemeant by “measurement error,” that we didn’t reflect that various fields give the term different, more specialized meanings.Again, I like the Wikipedia definition: “Observational error (or measurement error) is the difference between a measured value of quantity and its true value. In statistics, an error is not a “mistake”. Variability is an inherent part of things being measured and of the measurement process. Measurement errors can be divided into two components: random error and systematic error. Random errors are errors in measurement that lead to measurable values being inconsistent when repeated measures of a constant attribute or quantity are taken. Systematic errors are errors that are not determined by chance but are introduced by an inaccuracy (as of observation or measurement) inherent in the system. Systematic error may also refer to an error having a nonzero mean, so that its effect is not reduced when observations are averaged.” Even if it had not been possible for us to fully state and explain this definition in our paper, a reference (along with an acknowledgment of more specialized definitions of the term used in various fields) would’ve helped. (More details in examples could’ve helped too, but here the space limitation really was a constraint.)

Anyway, it’s been good to have an opportunity to have clarified this in the comments, and I appreciate everyone’s patience in the discussion.

jrc says:

February 15, 2017 at 10:00 pm

I think I get the problem. Here it is in data generating language.

Y = Observed_X + epsilon

Observed_X is “real X” plus noise – this is what Ulrich means by measurement error

### my code

y = true.cor*x + rnorm(N*n.sim)*sqrt(1-true.cor^2)

x is a standard normal

y is a function of x with regression coefficient true.cor and residual variance

this code creates an outcome variable with variance 1.

### compute reliabilities

rel = 1.1-seq(0:9)/10

### this is the amount of error variance needed to get reliabilties of .1 to 1 (REL = y + e)

e.var = 1/rel – 1

this creates the error variances

e = e*sqrt(e.var)

### compute oberved scores

oy = y + e

So the observed scores have three variances (a) variance explained by x, (b) residual variance due to other actors, and (c) random measurement error.

Forgive me, I’m bad at R code. I know, I should be ashamed.

Which is your (b) and which is your (c). And is the variance of “oy” always 1, or the variance of “y”?

How about structural equation modeling?

the model has x as an observed variable and oy as an observed variable.

The latent variable is y.

There is a causal effect (path) from x to y.

The model has two latent variables.

The residual variance of the latent variable Y.

And the residual variance in the observed variable OY.

Reliability is the proportion of y-variance in oy, Rel = Var(Y) / Var(oy) = Var(Y) / (Var(y) + Var(e)) with e = random measurement error.

Hi Andrew,

thanks for your response. I think we are making progress in reaching some understanding of definitions and interpretation of each others perspective.

Also, people have been discussing this in the Psychological Methods Discussion Group, which also help me to clarify some confusion.

https://www.facebook.com/groups/853552931365745/permalink/1278445418876492/

One issue that is becoming clearer is that we need to define what we mean by effect size.

Unstandardized effect sizes like the mean difference between two groups or an unstandardized regression coefficient in a multi-level model are not systematically affected by random measurement error. If the same study is repeated many times, the average effect size estimate is the same with or without measurement error.

Standardized effect sizes relatives the unstandardized effect size to the observed variability. Cohen’s d = Mean.Diff / SD. As random measurement error increases variability, standardized effect sizes decrease with increasing random measurement error.

In small samples, effect sizes estimate are more variable. This means it is more likely to get extreme effect sizes in small samples. They are less likely to be significant, but when they are significant, they are inflated. This is your main point, but there is no downward bias due to measurement error. We are starting with the same effect size as without measurement error. We might say that measurement error has an effect because the increased variability makes it harder to show significance, and so significant results need more inflation to become significant and be reported. But the reason is not that random error attenuates the effect size. The reason is that it inflates the residual variance.

Standardized effect sizes are affected by measurement error and decrease as a function of measurement error. In small samples with lots of sampling error there is more variability around the attenuated population observed effect size. But now we have to effects on the effect size estimates that become significant. Random measurement error makes them lower and selection for significance inflates them. My simulation shows that the selection for significant does not fully compensate for the attenuation due to random measurement error and that the most inflated estimates are obtained without measurement error.

Does this make sense to you?

FYI

http://jmbh.github.io/Deconstructing-ME/

Ulrich:

I followed the link. The author writes, “the lack of clarity does potentially also do quite some harm by confusing the reader about important concepts.” I hope the P.S. added to the above post helps clarify. Eric and I were thinking of measurement error in the general sense, and it does seem that there was some confusion because we did not explicitly define the term.

It would be good if it were general practice for every paper to come with its own blog post so that these sorts of issues could be resolved to the satisfaction of the readers. I’ve published hundreds of papers but it’s still so hard to anticipate ahead of time with communication problems might arise.

The blog post also linked to your code.

I noticed that you simulated measurement error in X and Y.

This makes sense for correlational study and measurement error in X will attenuate the unstandardized regression coefficient.

However, in experiments, measurement error in X is 0. In this case, the unstandardized regression coefficient is not attenuated by the introduction of measurement error in Y. You can make the variance in Y 1 or 100.

> N1 = 50000

> y1 = r*x + rnorm(N1,0,1)

> y2 = r*x + rnorm(N1,0,100)

> summary(lm(y1 ~ x))$coefficients[2,1]

[1] 0.1503872

> summary(lm(y1 ~ x))$coefficients[2,1]

[1] 0.1503872

So, the first premise of your claim that measurement error/noise leads to an underestimation of effect sizes in large samples is incorrect when we are use unstandardized effect sizes as effect size estimates.

Hi Andrew and Uli:

From reading that blog, it really just appears as though they stated the same information differently and with different figures. Their “smoking” gun of the scatter plots on same scale as others, also clearly shows the same information, but is not visually appealing or as clear. It also appeared in the blog that N not being fixed was some oversight, when in fact the figures in the paper make this very clear. It should also be noted that Science articles often have limited space to describe all, whereas the blogger had no such constraint.

I annotated your code and made the following changes.

I set error variance for x to 0 (simulating experimental studies with random assignment to groups).

I set error variance for y to 10 (unrealistic large value, but it helps to see the pattern more clearly).

### Number of simulations

n.sim = 1000

# First just the original two plots, high power N = 3000, low power N = 50, true slope = .15

### this is the population effect size

r <- .15

### this is the sample size (N = 50)

N = 50

### this stores the results

sims <- array(0,c(n.sim,4))

### this is error in the predictor variable (x)

xerror <- 0

### this is error in the criteiron variable (y)

yerror <- 10

### this is a loop that runs the simulations

for (i in 1:n.sim) {

### this creates a standard normal predictor variable with N cases

x <- rnorm(N,0,1)

### this creates the criterion variable with residual variance of 1

### importantly this means the variance in y = var(x)*r^2 + 1 = .15^2 + 1 = 1.0225

### the standardized effect size is .15 / (1.0225) ~ .15

y <- r*x + rnorm(N,0,1)

### xx is the unstandardized effect size estimate in a sample with N obervations

xx <-lm(y~x)

### the regression coefficient is stored

sims[i,1]<-summary(xx)$coefficients[2,1]

### this line of code creates a predictor variable (x) with measurement error

x <- x + rnorm(N,0,xerror)

### this line of code creates a predictor variable (x) with measurement error

y <- y + rnorm(N,0,yerror)

### xx is the unstandardized effect size estimate in a sample with N obervations

xx <- lm(y~x)

### the regression coefficient is stored

sims[i,2] <- summary(xx)$coefficients[2,1]

####

#### repeat everything with N = 3000

N = 3000

x <- rnorm(N,0,1)

y <- r*x + rnorm(N,0,1)

xx <-lm(y~x)

sims[i,3] <- summary(xx)$coefficients[2,1]

x <- x + rnorm(N,0,xerror)

y <- y + rnorm(N,0,yerror)

xx <- lm(y~x)

sims[i,4] <- summary(xx)$coefficients[2,1]

} ### End of Loop for simulations

colnames(sims) = c("N=50,ME=N","N=50,ME=Y","N=3000,ME=N","N=3000,ME=Y")

summary(sims)

Here are the results:

N=50,ME=N N=50,ME=Y N=3000,ME=N N=3000,ME=Y

Min. :0.0937 Min. :-0.48535 Min. :0.08544 Min. :-0.50082

1st Qu.:0.1380 1st Qu.: 0.03157 1st Qu.:0.13649 1st Qu.: 0.03192

Median :0.1504 Median : 0.14968 Median :0.15049 Median : 0.14810

Mean :0.1502 Mean : 0.15057 Mean :0.14972 Mean : 0.15256

3rd Qu.:0.1626 3rd Qu.: 0.27042 3rd Qu.:0.16174 3rd Qu.: 0.26372

Max. :0.2039 Max. : 1.24979 Max. :0.21481 Max. : 0.72539

1. We see that there is no systematic bias in the effect size estimate (mean = .15 in all four simulations).

2. We see that measurement error increases the variance in estimates for small and large samples.

3. We see the largest variability for small samples with measurement error.

If we select for significance, we get the largest unstandardized effect size from a small sample with lots of measurement error.

This is solely a function of increased variability.

There is no systematic underestimation due to measurement error in large samples that is reversed in small samples with selection for significance.

The source of the variance in Y is irrelevant. Only the amount of variability matters. We can also change the residual variance in your simulation (fixed at 1) and keep the error variance constant. We would get the same result.

https://www.facebook.com/photo.php?fbid=10154412781321687&set=p.10154412781321687&type=3&theater

Picture of the results for cleaner presentation.

Today was an exciting day of simulation.

Just to clarify for anyone who is dubious about Gelman and others’ statements here (though to me, they should be obvious):

https://www.facebook.com/photo.php?fbid=10211718000611706&set=p.10211718000611706&type=3&theater

https://www.facebook.com/photo.php?fbid=10211718211576980&set=p.10211718211576980&type=3&theater

https://www.facebook.com/photo.php?fbid=10211718076613606&set=p.10211718076613606&type=3&theater

d = “effect size if measurement were perfect”

d.actual = the effect size in the population of studies that use your measure.

d.select = the d statistic of those studies that were significant

alpha = cronbach’s alpha; .999 is treated as a perfect measure for simulation to work

There are two things working here.

1) The expected value of the effect size in a population of studies that use the crappy measure is lower than the expected value of the effect size in a population of studies free of measurement error. This is where the bias likely comes from. It /is/ true that bad measures underestimate error-free effect sizes. If you had 10000 studies with no measurement error, their effect sizes would be larger than the 10000 studies with measurement error.

2) That said, that point is irrelevant to the replication crisis. The issue is that in small samples, especially, with bad measures, especially, the effect sizes MUST be large by chance to detect something significant. In the population of studies using your crappy measure, the effect size is lower (due to measurement error). But if someone replicates your study using your measure (basically, drawing another study from the population of studies with the measure), they will probably not replicate it. Even if they do a power analysis with your [overestimated] effect size, their target is incorrect.

More concretely:

Without any measurement error, say groups A and B are different with an effect size of d = .4.

With measurement error, the expected value of the effect size is d = .25.

You obtain an estimate of d = .35. It’s significant, by chance, because it’s an overestimate.

That is, d=.35 is an overestimate of the population of individuals who use your measure, but an underestimate of the ‘true’ effect if there were no measurement error; this latter point again, is irrevelant to replication.

Someone sees that d=.35, tries to replicate your study and obtains d = .25; non significant. They may have even conducted a power analysis, aiming for .80 power assuming the true effect is .35, but of course, this is an overestimate, so their true power was less.

They actually obtain the asymptotically correct answer, and fail to replicate the finding.

That is all this article is saying, I think. More measurement error -> Needs high N or over estimate to detect effect. This results in replication failures. This is the same logic for why underpowered studies with significant effects necessarily overestimate the effect; in fact, bad measures only make studies more underpowered, so it’s not even the same logic, it’s quite literally the same thing. Moreover, I realized when doing this simulation that people do power analyses based on “true” measurement error-free effects; they could probably guess what the ‘true’ effect is, then adjust it downward to be the expected value of the effect given a crappy measure — The power analysis would be more “accurate”.

Stephen:

Yes, the statistical point is not subtle. We wrote the article because we see the “What does not kill my statistical significance makes it stronger” fallacy all the time, including from researchers with some statistical sophistication. Also we see lots of textbook examples and applied research projects that do not take measurement seriously, I think because people have the naive view that, if they’ve successfully attained statistical significance, that measurement isn’t something they really need to worry about. Although the math is simple, the conceptual error is widespread.

Ha Stephen, I am writing a paper on this (it is just one section in the paper) – power analysis taking reliability if the measure into account. If you are interested, send me an e-mail.

I agree with you it is essentially this; overestimation of effect size increases with publication bias, and decreases with effect size and N. Measurement error decreases effect size, hence overestimation increases (relative to the expected effect size with measurement error).

The devil is in the details.

Your Figure 3 shows that in most cases, the error free measure produces stronger evidence (you use b/se = t for comparison here) is more likely to produce stronger evidence for your scenario of r = .15 and SD of error variance = .5.

You show that for small sample sizes this flips and suddenly we see more than 50% cases, where the measure with error variance shows stronger evidence for an effect.

Ironically, your analysis is biased by selecting for significance. Here is your code:

First you select only those simulations that produced a significant result for the regression with measurement error.

temp 2,]

Second, you compare the evidence for these selected cases to the evidence for the ‘matching’ regression without measurement error.

propor[j] abs(temp[,1]/temp[,2])))[2]/length(temp[,1])

However, this comparison is biased by selecting for significance on the regression with measurement error.

We can reverse the analysis and first select those cases in which the regression without measurement error is significant.

Then we see which of the two matched regressions produces stronger evidence.

Now the regression without measurement error produces proportions over 50%. More important, the proportion is greater than the one for the selection based on the regression with measurement error.

I think there is a fundamental problem with selection for significance and a comparison of proportions.

A clearner way to compare the methods woudl be to compute the average strength of evidence for cases that produced significance.

We can then compare the average (mean or median) t-value and see whether random measurement error after selection for significance produces stronger evidence against the null-hypothesis than a measure without random measurement error after selection for significance.

Unless I am missing something here, I think your evidence in Fig. 3 is biased by selection for significance on the measure with measurement error.

Looking forward to hear your comment on your choice of the proportion measure and whether you think it is fair or biased.

Ok: http://imgur.com/a/F0ASG

Hi Ulrich,

What you describe is how we set it up, and I think what we intended. The scatterplot shows that sometimes the estimate with added error exceeds the estimate without error. Then we just condition on stat sig results because the paper is about the fallacy of thinking that if you got a result, it’s automatic to argue that it would have been more impressive without the added noise. We simulate in the direction we did because people observe an effect, and then they speculate how it would have looked in a more ideal setting.

I just want to be clear that we’re not saying that on average the effect is larger. On average the effect is attenuated the same, regardless of N. It’s just that when it comes to reasoning about individual cases, in a high power setting the attenuation will exceed the sampling variation, and in a low power setting that’s not a given.

So I don’t think I disagree with your simulations. It’s just you seem to looking at the expected value. Maybe the point in the article is just too simple. But I have to say that I’m not sure the distinction is always made clearly in discussions of measurement error that the attenuation is in expectation and not necessarily in individual cases. We have certainly seen the following argument made about individual studies: “And anyway, if there was measurement error it would be working against me and not in my favor.”

I don’t need an article and simulation to realize that sampling error creates uncertainty and that in a specific individual case, I can overestimate effect sizes even with an unreliable measure. That is what 95% confidence intervals are for. They show me how much my observed results can move around given the variability due to sampling error.

You wrote “For the largest samples, the observed effect is always smaller than the original. But for smaller N, a fraction of the observed effects exceeds the original. If we were to condition on whether or not the observed effect was statistically significant,

then the fraction is even larger (see the figure, right panel).”

I don’t see how this is a warning about individual cases. I think it suggests that the common assumption about the effect of measurement error does not apply when (a) the effect size is small, (b) the sample is small, and (c) there is selection for significance.

I also still do not see how your warning about the interpretation of effect sizes in individual cases is related to replicability.

Your title “Measurement error and the replication crisis” suggests that measurement error has some important implications for replicability, but the only way I see it related is that it reduces power just like small effects or small samples reduce power. Nothing paradoxical happens when we have small effects, small samples and measurement error and select for significance. We still end up with just significant p-values between .05 and .01 (t = 2 to 2.6).

“The consequences for scientific replication are obvious. Many published effects are overstated and future studies, powered

by the expectation that the effects can be replicated, might be destined to fail before they even begin.”

Yes, but the key problem here is selection for significance and not reporting the non-significant results that are expected given low power. I just don’t get what measurement error has to do with this?

Ulrich:

1. You write, “I don’t need an article and simulation to realize . . .” That’s fine, but the audience for this article is not just you! As Eric and I have discussed, people make the “What does not kill my statistical significance makes it stronger” fallacy all the time, so it did not seem like such a waste of two journal pages to lay out the problem!

2. You write, “I don’t see how this is a warning about individual cases.”. Through the magic of copy-and-paste, I can include a simple example yet again: Consider two studies of a phenomenon with underlying effect size of somewhere in the range (-1, 1). The first study has a reliable measure and gives standard error of 1. In that case the estimated effect, in order to be published, must be at least 2 in absolute value, thus an exaggeration factor (type M error) of at least 2. The second study has an unreliable measure and gives standard error of 10. In that case the estimated effect, in order to be published, must be at least 20 in absolute value, thus an exaggeration factor (type M error) of at least 20. The study with the unreliable measure will, on average, give estimates that are larger in absolute value.

3. You write, “the key problem here is selection for significance and not reporting the non-significant results that are expected given low power. I just don’t get what measurement error has to do with this?” Measurement error is important in that it creates the conditions for statistically-significant estimates to have the wrong sign (type S error), to be huge overestimates (type M error), and to have validity problems (because of biased measurements as in the fertility example I mentioned earlier in the comments). All three of these problems contribute to the replication crisis.

I can’t quite figure out why you seem to be so sure that Eric and I are wrong here—my guess is that early on you committed to the idea that we had made a mistake, and it’s been hard for you to move away from that frame. That said, this discussion has been helpful to me in motivating me to explicate our reasoning more carefully, so I thank you for that.

Let’s start with some terminology.

What do you mean by an effect size in the range between -1 and 1. What is the unit of measurement? Are we talking correlations, regression coefficients, standardized mean differences? In your article you even refer to t-values as effects.

Are we talking about population effect sizes that were obtained with unreliable measures or are we talking about population effect sizes for the actual constructs that are being measured without measurement error?

The literature on effect sizes is large and confusing. I don’t think we are going to make much progress, if we do not clearly define what we mean by effect size.

Can you please specify the meaning of “underlying effect size of somewhere in the range (-1,1).”

Ulrich:

We have a zillion examples. Here’s one that I’ve talked about in a few of my papers: It was analysis of a survey comparing the probability of voting for Obama for president among women at different phases of their monthly cycle. An effect of -1 or +1 would represent a 1% decrease or increase in the probability of voting for Obama.

Ok, let’s go with this example.

So, how does measurement error effect your example. We have a measure of phase of the month and a measure of voting intentions. What is a reasonable amount of random measurement error in these measures?

Ulrich:

We discuss the example here, and you can work it out from there. If you have further questions, I refer you again to this comment. That’s all I can do on this; I’m outta here.

Isn’t this only true in the cases where they approximate a credible interval?

We are not going to make much progress here, if we enter a Bayesian vs. Frequentist debate. The black hole of statistics discussions that swallows all reasonable discussions.

A confidence interval tells us how uncertain my estimate of an effect size is. If I find d = .50 and my 95%CI is .01 to .99, I probably should not go around and tell everybody that I found an effect and that the effect size IS half a standard deviation.

If I find d = .50 and my 95%CI is .49 to .51, I am justified to proclaim that I found an effect and the effect size is likely to be around half a standard deviation.

Please take note of the fact that to get precise 95%confidence intervals you need large samples and when you have large samples the prior of a Bayesian analysis washes out and a 95%confidence interval is practically indistinguishable from a 95%confidence interval.

So, yes it makes no sense to interpret observed effect sizes in small samples as estimates of an effect size because these intervals are so huge.

>>>The black hole of statistics discussions that swallows all reasonable discussions.<<<

Amen to that!

If only more people could get over the abstract philosophical pontificating we could focus on the actual problems. Bayesian, frequentist, who cares. Used prudently both can be good tools.

Why is there no like button.

@Rahul that is itself an abstract philosophical argument, because the phrase used prudently often doesn’t apply. Who cares about abstract philosophical musings about the frequentist/Bayesian divide; I wish people would focus on solving the very real problems that arise through non prudent application.

So, instead of rejecting a bogus and meaningless point null, if we focus on estimating the posterior distribution of the parameter of interest, i don’t really care if you do it through a frequentist misinterpretation of confidence intervals (except in cses where credible and confidence intervals differ, see Morey’s work) or do it right using a Bayesian analysis. In fact i often fit freq models to get a sense of where things are going before i invest effort into a Bayesian analysis.

The reason Bayesians get het up, imho, is that people grossly misrepresent what we can conclude from a freq analysis. That’s about as practical as it gets, eliminating these misunderstandings. I think i lost some three years in a scientific argument with a group arguing for a null point hyp based on low power studies, and i think my colleagues still don’t get it. I also get into arguments with people who think that a super low p-value tells us we can be sure the effect is real. How much more practical do these problems need to be to start thinking about the so called abstract philosophical debate abt freq/Bayes?

Shravan:

First I do think some Bayesians are on a path to getting less wrong inference practices (along with less wrongly conceived rationales to explain how and why) but –

The reason many Bayesians’ arguments get dismissed, imho, is that they grossly misrepresent what we can conclude from posterior probabilities in most applications.

In particular, I am looking forward to the discussions of this paper http://www.stat.columbia.edu/~gelman/research/published/objectivityr5.pdf

Keith wrote: “The reason many Bayesians’ arguments get dismissed, imho, is that they grossly misrepresent what we can conclude from posterior probabilities in most applications.”

I read the paper. Are you referring to Objective Bayesianism? Or the use of frequentist probabilities in Bayesian stats? Who are the Bayesians you are referring to?

“If I find d = .50 and my 95%CI is .01 to .99, I probably should not go around and tell everybody that I found an effect and that the effect size IS half a standard deviation.

If I find d = .50 and my 95%CI is .49 to .51, I am justified to proclaim that I found an effect and the effect size is likely to be around half a standard deviation.”

Ok, explain that difference to all of psychology and get the journals to fall in line. Otherwise, those result are indistinguishable to the tabloids. This seems far off of the original point, that small samples produce more variable effect size estimates and therefore produce a larger proportion of studies where the [95% CI includes 0 / p-value < .05].

This actually isn’t the point though. The point is that small samples estimating an effect with measurement error are especially prone to overestimates after publication bias, basically. The expected value of an effect with measurement error is lower than without measurement error. Large samples mimic that, and the expected value, conditioned on selection is lower as you’d expect. But small samples, selected for significance, tend to vastly overestimate the expected effect with measurement error AND without measurement error.

Assuming your stats are “sound” (your sampling intention lines up with the statistical assumption), no matter the sample size, alpha = .05 does mean that 5% of the time you’ll falsely reject the null. small samples aren’t more or less likely to make this error. The issue is with the estimate you obtain with or without measurement error after conditioning on significance. Large samples won’t vastly overestimate the effect, small samples will.

sorry copy mistake. Here is the formula for the proportion measure.

propor[j] abs(temp[,1]/temp[,2])))[2]/length(temp[,1])

sorry it doesn’t copy properly

propor = table((abs(temp3/temp4) greater than abs(temp1/temp2)[2] / n.sim

Andrew,

too bad that you don’t want to continue our open, adversarial collaboration

I will continue my work on this without you and we may continue our conversation during the review process of a commentary for Science.

So long,

Ulrich

https://www.facebook.com/groups/853552931365745/permalink/1280894358631598/

Thanks Ulrich. Appreciate your efforts to challenge us. As you continue to work on your rebuttal, just want to be sure you are addressing the central point. Here’s a couple of examples. Shadish Cook and Campbell, p.49 “Unreliability always attenuates bivariate relationships.” What does “always” mean here? Or take the recent APS (was it APS?) argument that field studies are a way out of the replicability crisis because effects that are found under noisy conditions are more likely to be robust and replicate. Measurement error is a complex topic and we dealt with it simply. But the reason for that is that we kept coming across the presumption that in the bivariate case, or the 2 by 2 epi case, that attenuation due to measurement error was automatic. We did find the Jurek and Greenland et al (2006) which makes the similar point in the epi literature. But it seems pretty common in a limitations section of an epi paper to say, there may have been misclassification, but if so it would attenuate anyway. That is true, for large studies. A smaller study shouldn’t just argue away the potential impact of misclassification so easily. Our impression is that it’s not uncommon to make the argument that if not for the noisy interference my results would have been better.

All statistical tests work with the assumption that there is no selection for significance or p-hacking or worse.

It is an unspoken default assumption. Scientists just go in their lab, analyze the data, and report the results.

There have been warnings that all of the inferences based on statistics are wrong if this unspoken assumption is violated.

Sterling (1959) is my favorite, but there are many others and probably even older ones.

Once we select for significance and hide non-significant results, effect sizes are inflated, the type-I error rate is inflated, etc.

What makes your article interesting is that you are pitting two biases against each other.

We have random measurement error that attenuates effect sizes (the effect we could have gotten with a perfectly reliable measure) and reduces statistical power.

We have selection for significance and this leads to an inflation of observed effect sizes and observed power (not true power).

How do these two biases combine? Do they cancel each other out? Do we get inflation or underestimation. Do reliable measure produce more inflation than unreliable measures?

These are all interesting questions. Maybe you didn’t really set out to answer these questions. Maybe Figure 3 was not intended to show that we get more inflation from reliable measures than from unreliable measures.

But my question remains, what are you trying to say about measurement error and the replication crisis? What does measurement error have to do with the replication crisis? My answer is that measurement error will reduce effect sizes and power. As a result, many studies are underpowered, leading to non-significant results. That is, measurement error attenuates effect sizes in large and small samples. But researchers cannot afford non-significant results. Therefore, they find ways to report p < .05 in underpowered studies. This inflates reported effect sizes. But it does so for reliable and unreliable measures.

Maybe you agree with this account, but I don't see your article making this point. I believe readers will think that in some mysterious way more measurement error can lead to more inflation in small samples and that these results are particularly difficult to replicate, but I don't see any evidence for this. Replicability is solely a function of power and studies with less power will be more dififcult to replicate. Whether low power is due to unreliable measures or small effects on reliable measures is irrelevant.

There is ample evidence for this.

If there is measurement error, the expectation of the effect is lower than without. This is attenuation.

My plots and their plots both show this (e.g., https://www.facebook.com/photo.php?fbid=10211731268223388&set=p.10211731268223388&type=3&theater or the several others I’ve posted).

If you have low N, you will generally overestimate the effect if you select for significance.

If you have low N and measurement error, you are not only overestimating the error-free effect, but even moreso the expected effect in the population using your measure.

So yes, more measurement error -> lower effects and higher SE -> less power -> selected results are more biased.

But the point of THIS particular paper is to address the fallacy of “if I detected this with a low N and a bad measure, the effect must be even larger than what I detected” and it’s simply false; the expected value of significant results with low N and a bad measure is higher than the expected value of no measurement error and especially higher than the expected value of the effect under measurement error. This flawed thinking causes people to vastly overestimate not only the effect, but also the replicability of the effect. I think their article exactly makes this point, as has the simulations done by myself to prove this point to you. The plots show that as N increases, this bias decreases to the expected value of the population under error (meaning, the expected value of significant results approaches the expected value under error, and correctly estimates the expected value of the effect under error). But with low N, obviously the iron rule is very wrong; the expectation under low N for significant results is assuredly not an attenuated effect, it is an overestimate.

They are not saying this is the sole cause of replication problems. They are saying the logical fallacy above certainly contributes to it.

Stephen:

+1. Indeed, I wanted to call the paper, “The ‘What does not kill my statistical significance makes it stronger’ fallacy,” but the journal editor wouldn’t allow it because it exceeded their limit for the number of characters in the title.

Also, I think measurement error is huge, and hugely ignored by researchers, in part because of the focus on statistical significance. The (fallacious) reasoning goes as follows:

A. Measurement error only is important in regards to its effect on precision of estimates.

B. Precision of estimate is measured by the standard error.

C. The estimate is statistically significant (more than 2 standard errors from zero), thus precision must not have been a problem.

D. Thus, measurement error was not a problem in the experiment.

This reasoning is wrong and I thought it was worth a paper in Science to make the point.

@Andrew

Tangential question:

You’ve worked a lot with survey data. Did you typically re-poll the same person twice just to check the stability / repeatably of your survey instrument as a measurement tool? I never recall seeing this done / reported.

I think of this as the analog of (say) testing the repeatably of a weighing scale by measuring the same test mass multiple times.

Rahul:

Yes, we did this in our Xbox study and we found that very few people reported changing their vote preferences. There are a lot of panel surveys out there, and there’s been research going back at least since the 1950s on the variability of survey responses when people are interviewed multiple times.

@Andrew

But why isn’t this standard procedure for *every* survey study? i.e. Take like 5% of people polled & re-survey them to evaluate noisiness in your measurement.

Isn’t every survey topic / cohort & instrument different enough that you’d want to know the particular noisiness in *your* measurement every time than just falling back upon general research on survey variability?

Additionally, shouldn’t re-polling serve as a sort of internal Quality Control on how well you actually administer the survey?

Rahul:

1. You won’t learn much from resampling 5%, it’s just too noisy.

2. It can be hard to re-contact people and get them to do another survey. Maybe this is less of a problem with the internet, but we paid for questions on an internet survey where people were re-contacted after a year, and only 2/3 of the people responded the second time.

3. I don’t know so much about what goes on inside survey organizations, so I don’t know how much quality control they do. But it’s the usual story, that the biggest concerns about quality control come if there are noticeable flaws in the final product; otherwise maybe nobody cares. It would be interesting to know what Gallup and other major survey organizations did back in the 1950s-1970s, when it’s my impression that polling was a more stable business.

@Andrew

Thanks.

#1 is ironic: We cannot use 5% sampling to measure noisiness coz’ the measurement of noise would itself be too noisy?

Isn’t that just an argument for using an *even bigger* re-sample to quantify the noise?

I get the feeling that indeed “measurement error is huge, and hugely ignored by researchers” but that ignorance is conscious. Where’s the noise-quantification & quality control on surveys?

@Andrew

>>> the biggest concerns about quality control come if there are noticeable flaws in the final product;<<<

That approach seems dangerous, especially in the social sciences. How do you close the feedback look? i.e. Will flaws be noticeable?

e.g. If my survey predicts that gay-canvassers change voter-opinion etc. how does one intuitively tell from the final-product if the survey-inputs were crap?

Sure, Garbage-in-Garbage-Out but in the sort of soc. sci. studies I see today there's often no obvious smell test to identify the garbage coming out!

Hi Eric,

can you help me with this sentence.

If researchers focus on getting statistically significant estimates of small effects, using noisy measurements and small samples, then it is likely that the additional sources of variance are already making the t test look strong. Measurement error and selection bias thus can combine to exacerbate the replication crisis.

I cannot help but read it as if it suggests that Measurement error and selection bias in combination are worse than selection bias. But your figure shows that effect sizes are generally more inflated with the reliable measure. Even with small samples, you barely reach the 50% mark where suddenly the unreliable measure is more likely to inflate the evidence more.

I read it as measurement error in Y, not measurement error in X (or “treatment”), with “measurement error” in Y including all (or much) of the unexplained variance. And then it makes perfect sense, no?

sorry, it doesn’t help. Your response seems to imply that you are treating all residual variance as measurement error.

However, residual variance is only partially measurement error and partially due to other causal factors.

Still not clear what you guys mean by measurement error and selection bias combine to exacerbate the replication crisis.

Would you be ok with a statement “low power and selection bias combine to exacerbate the replication crisis”?

Do you see a difference between these two statements or do you see them as saying the same thing?

Just to be clear, i have no affiliation with Eric or Andrew. Nor do i have a big stake in what constitutes “measurement error”. Nor have I carefully read this paper.

I’m just arguing that if you think of “measurement error” as increasing the variance of the outcome variable (Y), then that paragraph makes sense. And I think that distinction between measurement error in X and measurement error in Y clarifies a lot of the back-forth in this discussion, which reads like a lot of everyone talking past everyone.

Yes, the lack of clear definitions creates a lot of confusion.

Confusing residual variance in a perfectly reliable measure with random measurement error does not help.

Yes, as we increase residual variance in Y while holding the regression coefficient constant, we are reducing the standardized effect size. And if we reduce the standardized effect size, we increase sampling error, and as we increase sampling error, we reduce power (increase type-II error), and if we select for significance, we get more inflation of the observed effect size in the subset of significant studies.

That is all very straightforward and if this is all LG tried to say, they are not wrong, they just said it in a very strange and complicated and confusing way, IMO.

If they said more than this, it would be nice to hear from them and to see what it is so that we can actually test it.

All I can say is that the graphs are potentially misleading and that the proportion measure is problematic to be polite.

In your simulation, you use SD = .5 to specify the error variance with SD = VAR = 1 for the true variance.

This gives us 1/(1+.5^2) = .80.

Now a reliabilty of .80 is not 1, but it is not a noisy or unreliable measure.

Why did you not use a more radical example of random measurement error (say reliabilty of .5 or lower)?

Andrew and Eric,

I may have finally reconciled our differences in my head.

Figure 3 compares t-values with error and selection for significance to t-values without error AND WITHOUT SELECTION FOR SIGNIFICANCE.

In this scenario, the selection for significance will compensate for the loss in signal due to measurement error. However, with larger samples there is less selection for significance because power reaches 100% and now we only see the effect of random measurement error.

My confusion arose because I find it not very meaningful to examine the effect of random measurement error by comparing selected t-values to t-values that are not selected. I would find it more meaningful to compare the effect of random measurement error on t-values that are selected for significance. That is, there is selection for significance with and without measurement error.

Maybe it was news to some readers of Science that random error attenuates t-values (relative to those with no measurement error) and that selection for significance inflates t-values (relative to a scenario without selection for significance). If this was your only point, I think there would have been a simpler way of saying this, but of course there is no disagreement between you and me on this point.