I may have finally reconciled our differences in my head.

Figure 3 compares t-values with error and selection for significance to t-values without error AND WITHOUT SELECTION FOR SIGNIFICANCE.

In this scenario, the selection for significance will compensate for the loss in signal due to measurement error. However, with larger samples there is less selection for significance because power reaches 100% and now we only see the effect of random measurement error.

My confusion arose because I find it not very meaningful to examine the effect of random measurement error by comparing selected t-values to t-values that are not selected. I would find it more meaningful to compare the effect of random measurement error on t-values that are selected for significance. That is, there is selection for significance with and without measurement error.

Maybe it was news to some readers of Science that random error attenuates t-values (relative to those with no measurement error) and that selection for significance inflates t-values (relative to a scenario without selection for significance). If this was your only point, I think there would have been a simpler way of saying this, but of course there is no disagreement between you and me on this point.

]]>I read the paper. Are you referring to Objective Bayesianism? Or the use of frequentist probabilities in Bayesian stats? Who are the Bayesians you are referring to?

]]>First I do think some Bayesians are on a path to getting less wrong inference practices (along with less wrongly conceived rationales to explain how and why) but –

The reason many Bayesians’ arguments get dismissed, imho, is that they grossly misrepresent what we can conclude from posterior probabilities in most applications.

In particular, I am looking forward to the discussions of this paper http://www.stat.columbia.edu/~gelman/research/published/objectivityr5.pdf

]]>The reason Bayesians get het up, imho, is that people grossly misrepresent what we can conclude from a freq analysis. That’s about as practical as it gets, eliminating these misunderstandings. I think i lost some three years in a scientific argument with a group arguing for a null point hyp based on low power studies, and i think my colleagues still don’t get it. I also get into arguments with people who think that a super low p-value tells us we can be sure the effect is real. How much more practical do these problems need to be to start thinking about the so called abstract philosophical debate abt freq/Bayes?

]]>>>> the biggest concerns about quality control come if there are noticeable flaws in the final product;<<<

That approach seems dangerous, especially in the social sciences. How do you close the feedback look? i.e. Will flaws be noticeable?

e.g. If my survey predicts that gay-canvassers change voter-opinion etc. how does one intuitively tell from the final-product if the survey-inputs were crap?

Sure, Garbage-in-Garbage-Out but in the sort of soc. sci. studies I see today there's often no obvious smell test to identify the garbage coming out!

]]>Thanks.

#1 is ironic: We cannot use 5% sampling to measure noisiness coz’ the measurement of noise would itself be too noisy?

Isn’t that just an argument for using an *even bigger* re-sample to quantify the noise?

I get the feeling that indeed “measurement error is huge, and hugely ignored by researchers” but that ignorance is conscious. Where’s the noise-quantification & quality control on surveys?

]]>1. You won’t learn much from resampling 5%, it’s just too noisy.

2. It can be hard to re-contact people and get them to do another survey. Maybe this is less of a problem with the internet, but we paid for questions on an internet survey where people were re-contacted after a year, and only 2/3 of the people responded the second time.

3. I don’t know so much about what goes on inside survey organizations, so I don’t know how much quality control they do. But it’s the usual story, that the biggest concerns about quality control come if there are noticeable flaws in the final product; otherwise maybe nobody cares. It would be interesting to know what Gallup and other major survey organizations did back in the 1950s-1970s, when it’s my impression that polling was a more stable business.

]]>This gives us 1/(1+.5^2) = .80.

Now a reliabilty of .80 is not 1, but it is not a noisy or unreliable measure.

Why did you not use a more radical example of random measurement error (say reliabilty of .5 or lower)?

]]>But why isn’t this standard procedure for *every* survey study? i.e. Take like 5% of people polled & re-survey them to evaluate noisiness in your measurement.

Isn’t every survey topic / cohort & instrument different enough that you’d want to know the particular noisiness in *your* measurement every time than just falling back upon general research on survey variability?

Additionally, shouldn’t re-polling serve as a sort of internal Quality Control on how well you actually administer the survey?

]]>Yes, we did this in our Xbox study and we found that very few people reported changing their vote preferences. There are a lot of panel surveys out there, and there’s been research going back at least since the 1950s on the variability of survey responses when people are interviewed multiple times.

]]>Tangential question:

You’ve worked a lot with survey data. Did you typically re-poll the same person twice just to check the stability / repeatably of your survey instrument as a measurement tool? I never recall seeing this done / reported.

I think of this as the analog of (say) testing the repeatably of a weighing scale by measuring the same test mass multiple times.

]]>Confusing residual variance in a perfectly reliable measure with random measurement error does not help.

Yes, as we increase residual variance in Y while holding the regression coefficient constant, we are reducing the standardized effect size. And if we reduce the standardized effect size, we increase sampling error, and as we increase sampling error, we reduce power (increase type-II error), and if we select for significance, we get more inflation of the observed effect size in the subset of significant studies.

That is all very straightforward and if this is all LG tried to say, they are not wrong, they just said it in a very strange and complicated and confusing way, IMO.

If they said more than this, it would be nice to hear from them and to see what it is so that we can actually test it.

All I can say is that the graphs are potentially misleading and that the proportion measure is problematic to be polite.

]]>+1. Indeed, I wanted to call the paper, “The ‘What does not kill my statistical significance makes it stronger’ fallacy,” but the journal editor wouldn’t allow it because it exceeded their limit for the number of characters in the title.

Also, I think measurement error is huge, and hugely ignored by researchers, in part because of the focus on statistical significance. The (fallacious) reasoning goes as follows:

A. Measurement error only is important in regards to its effect on precision of estimates.

B. Precision of estimate is measured by the standard error.

C. The estimate is statistically significant (more than 2 standard errors from zero), thus precision must not have been a problem.

D. Thus, measurement error was not a problem in the experiment.

This reasoning is wrong and I thought it was worth a paper in Science to make the point.

I’m just arguing that if you think of “measurement error” as increasing the variance of the outcome variable (Y), then that paragraph makes sense. And I think that distinction between measurement error in X and measurement error in Y clarifies a lot of the back-forth in this discussion, which reads like a lot of everyone talking past everyone.

]]>However, residual variance is only partially measurement error and partially due to other causal factors.

Still not clear what you guys mean by measurement error and selection bias combine to exacerbate the replication crisis.

Would you be ok with a statement “low power and selection bias combine to exacerbate the replication crisis”?

Do you see a difference between these two statements or do you see them as saying the same thing?

]]>If there is measurement error, the expectation of the effect is lower than without. This is attenuation.

My plots and their plots both show this (e.g., https://www.facebook.com/photo.php?fbid=10211731268223388&set=p.10211731268223388&type=3&theater or the several others I’ve posted).

If you have low N, you will generally overestimate the effect if you select for significance.

If you have low N and measurement error, you are not only overestimating the error-free effect, but even moreso the expected effect in the population using your measure.

So yes, more measurement error -> lower effects and higher SE -> less power -> selected results are more biased.

But the point of THIS particular paper is to address the fallacy of “if I detected this with a low N and a bad measure, the effect must be even larger than what I detected” and it’s simply false; the expected value of significant results with low N and a bad measure is higher than the expected value of no measurement error and especially higher than the expected value of the effect under measurement error. This flawed thinking causes people to vastly overestimate not only the effect, but also the replicability of the effect. I think their article exactly makes this point, as has the simulations done by myself to prove this point to you. The plots show that as N increases, this bias decreases to the expected value of the population under error (meaning, the expected value of significant results approaches the expected value under error, and correctly estimates the expected value of the effect under error). But with low N, obviously the iron rule is very wrong; the expectation under low N for significant results is assuredly not an attenuated effect, it is an overestimate.

They are not saying this is the sole cause of replication problems. They are saying the logical fallacy above certainly contributes to it.

]]>can you help me with this sentence.

If researchers focus on getting statistically significant estimates of small effects, using noisy measurements and small samples, then it is likely that the additional sources of variance are already making the t test look strong. Measurement error and selection bias thus can combine to exacerbate the replication crisis.

I cannot help but read it as if it suggests that Measurement error and selection bias in combination are worse than selection bias. But your figure shows that effect sizes are generally more inflated with the reliable measure. Even with small samples, you barely reach the 50% mark where suddenly the unreliable measure is more likely to inflate the evidence more.

]]>It is an unspoken default assumption. Scientists just go in their lab, analyze the data, and report the results.

There have been warnings that all of the inferences based on statistics are wrong if this unspoken assumption is violated.

Sterling (1959) is my favorite, but there are many others and probably even older ones.

Once we select for significance and hide non-significant results, effect sizes are inflated, the type-I error rate is inflated, etc.

What makes your article interesting is that you are pitting two biases against each other.

We have random measurement error that attenuates effect sizes (the effect we could have gotten with a perfectly reliable measure) and reduces statistical power.

We have selection for significance and this leads to an inflation of observed effect sizes and observed power (not true power).

How do these two biases combine? Do they cancel each other out? Do we get inflation or underestimation. Do reliable measure produce more inflation than unreliable measures?

These are all interesting questions. Maybe you didn’t really set out to answer these questions. Maybe Figure 3 was not intended to show that we get more inflation from reliable measures than from unreliable measures.

But my question remains, what are you trying to say about measurement error and the replication crisis? What does measurement error have to do with the replication crisis? My answer is that measurement error will reduce effect sizes and power. As a result, many studies are underpowered, leading to non-significant results. That is, measurement error attenuates effect sizes in large and small samples. But researchers cannot afford non-significant results. Therefore, they find ways to report p < .05 in underpowered studies. This inflates reported effect sizes. But it does so for reliable and unreliable measures.

Maybe you agree with this account, but I don't see your article making this point. I believe readers will think that in some mysterious way more measurement error can lead to more inflation in small samples and that these results are particularly difficult to replicate, but I don't see any evidence for this. Replicability is solely a function of power and studies with less power will be more dififcult to replicate. Whether low power is due to unreliable measures or small effects on reliable measures is irrelevant.

]]>Assuming your stats are “sound” (your sampling intention lines up with the statistical assumption), no matter the sample size, alpha = .05 does mean that 5% of the time you’ll falsely reject the null. small samples aren’t more or less likely to make this error. The issue is with the estimate you obtain with or without measurement error after conditioning on significance. Large samples won’t vastly overestimate the effect, small samples will.

]]>If I find d = .50 and my 95%CI is .49 to .51, I am justified to proclaim that I found an effect and the effect size is likely to be around half a standard deviation.”

Ok, explain that difference to all of psychology and get the journals to fall in line. Otherwise, those result are indistinguishable to the tabloids. This seems far off of the original point, that small samples produce more variable effect size estimates and therefore produce a larger proportion of studies where the [95% CI includes 0 / p-value < .05].

]]>I agree with you it is essentially this; overestimation of effect size increases with publication bias, and decreases with effect size and N. Measurement error decreases effect size, hence overestimation increases (relative to the expected effect size with measurement error). ]]>

too bad that you don’t want to continue our open, adversarial collaboration

I will continue my work on this without you and we may continue our conversation during the review process of a commentary for Science.

So long,

Ulrich

https://www.facebook.com/groups/853552931365745/permalink/1280894358631598/

]]>We discuss the example here, and you can work it out from there. If you have further questions, I refer you again to this comment. That’s all I can do on this; I’m outta here.

]]>So, how does measurement error effect your example. We have a measure of phase of the month and a measure of voting intentions. What is a reasonable amount of random measurement error in these measures?

]]>We have a zillion examples. Here’s one that I’ve talked about in a few of my papers: It was analysis of a survey comparing the probability of voting for Obama for president among women at different phases of their monthly cycle. An effect of -1 or +1 would represent a 1% decrease or increase in the probability of voting for Obama.

]]>Amen to that!

If only more people could get over the abstract philosophical pontificating we could focus on the actual problems. Bayesian, frequentist, who cares. Used prudently both can be good tools.

]]>A confidence interval tells us how uncertain my estimate of an effect size is. If I find d = .50 and my 95%CI is .01 to .99, I probably should not go around and tell everybody that I found an effect and that the effect size IS half a standard deviation.

If I find d = .50 and my 95%CI is .49 to .51, I am justified to proclaim that I found an effect and the effect size is likely to be around half a standard deviation.

Please take note of the fact that to get precise 95%confidence intervals you need large samples and when you have large samples the prior of a Bayesian analysis washes out and a 95%confidence interval is practically indistinguishable from a 95%confidence interval.

So, yes it makes no sense to interpret observed effect sizes in small samples as estimates of an effect size because these intervals are so huge.

]]>What do you mean by an effect size in the range between -1 and 1. What is the unit of measurement? Are we talking correlations, regression coefficients, standardized mean differences? In your article you even refer to t-values as effects.

Are we talking about population effect sizes that were obtained with unreliable measures or are we talking about population effect sizes for the actual constructs that are being measured without measurement error?

The literature on effect sizes is large and confusing. I don’t think we are going to make much progress, if we do not clearly define what we mean by effect size.

Can you please specify the meaning of “underlying effect size of somewhere in the range (-1,1).”

]]>That is what 95% confidence intervals are for. They show me how much my observed results can move around given the variability due to sampling error.

Isn’t this only true in the cases where they approximate a credible interval?

]]>1. You write, “I don’t need an article and simulation to realize . . .” That’s fine, but the audience for this article is not just you! As Eric and I have discussed, people make the “What does not kill my statistical significance makes it stronger” fallacy all the time, so it did not seem like such a waste of two journal pages to lay out the problem!

2. You write, “I don’t see how this is a warning about individual cases.”. Through the magic of copy-and-paste, I can include a simple example yet again: Consider two studies of a phenomenon with underlying effect size of somewhere in the range (-1, 1). The first study has a reliable measure and gives standard error of 1. In that case the estimated effect, in order to be published, must be at least 2 in absolute value, thus an exaggeration factor (type M error) of at least 2. The second study has an unreliable measure and gives standard error of 10. In that case the estimated effect, in order to be published, must be at least 20 in absolute value, thus an exaggeration factor (type M error) of at least 20. The study with the unreliable measure will, on average, give estimates that are larger in absolute value.

3. You write, “the key problem here is selection for significance and not reporting the non-significant results that are expected given low power. I just don’t get what measurement error has to do with this?” Measurement error is important in that it creates the conditions for statistically-significant estimates to have the wrong sign (type S error), to be huge overestimates (type M error), and to have validity problems (because of biased measurements as in the fertility example I mentioned earlier in the comments). All three of these problems contribute to the replication crisis.

I can’t quite figure out why you seem to be so sure that Eric and I are wrong here—my guess is that early on you committed to the idea that we had made a mistake, and it’s been hard for you to move away from that frame. That said, this discussion has been helpful to me in motivating me to explicate our reasoning more carefully, so I thank you for that.

]]>You wrote “For the largest samples, the observed effect is always smaller than the original. But for smaller N, a fraction of the observed effects exceeds the original. If we were to condition on whether or not the observed effect was statistically significant,

then the fraction is even larger (see the figure, right panel).”

I don’t see how this is a warning about individual cases. I think it suggests that the common assumption about the effect of measurement error does not apply when (a) the effect size is small, (b) the sample is small, and (c) there is selection for significance.

I also still do not see how your warning about the interpretation of effect sizes in individual cases is related to replicability.

Your title “Measurement error and the replication crisis” suggests that measurement error has some important implications for replicability, but the only way I see it related is that it reduces power just like small effects or small samples reduce power. Nothing paradoxical happens when we have small effects, small samples and measurement error and select for significance. We still end up with just significant p-values between .05 and .01 (t = 2 to 2.6).

“The consequences for scientific replication are obvious. Many published effects are overstated and future studies, powered

by the expectation that the effects can be replicated, might be destined to fail before they even begin.”

Yes, but the key problem here is selection for significance and not reporting the non-significant results that are expected given low power. I just don’t get what measurement error has to do with this?

]]>What you describe is how we set it up, and I think what we intended. The scatterplot shows that sometimes the estimate with added error exceeds the estimate without error. Then we just condition on stat sig results because the paper is about the fallacy of thinking that if you got a result, it’s automatic to argue that it would have been more impressive without the added noise. We simulate in the direction we did because people observe an effect, and then they speculate how it would have looked in a more ideal setting.

I just want to be clear that we’re not saying that on average the effect is larger. On average the effect is attenuated the same, regardless of N. It’s just that when it comes to reasoning about individual cases, in a high power setting the attenuation will exceed the sampling variation, and in a low power setting that’s not a given.

So I don’t think I disagree with your simulations. It’s just you seem to looking at the expected value. Maybe the point in the article is just too simple. But I have to say that I’m not sure the distinction is always made clearly in discussions of measurement error that the attenuation is in expectation and not necessarily in individual cases. We have certainly seen the following argument made about individual studies: “And anyway, if there was measurement error it would be working against me and not in my favor.”

]]>propor = table((abs(temp3/temp4) greater than abs(temp1/temp2)[2] / n.sim

]]>propor[j] abs(temp[,1]/temp[,2])))[2]/length(temp[,1])

]]>Your Figure 3 shows that in most cases, the error free measure produces stronger evidence (you use b/se = t for comparison here) is more likely to produce stronger evidence for your scenario of r = .15 and SD of error variance = .5.

You show that for small sample sizes this flips and suddenly we see more than 50% cases, where the measure with error variance shows stronger evidence for an effect.

Ironically, your analysis is biased by selecting for significance. Here is your code:

First you select only those simulations that produced a significant result for the regression with measurement error.

temp 2,]

Second, you compare the evidence for these selected cases to the evidence for the ‘matching’ regression without measurement error.

propor[j] abs(temp[,1]/temp[,2])))[2]/length(temp[,1])

However, this comparison is biased by selecting for significance on the regression with measurement error.

We can reverse the analysis and first select those cases in which the regression without measurement error is significant.

Then we see which of the two matched regressions produces stronger evidence.

Now the regression without measurement error produces proportions over 50%. More important, the proportion is greater than the one for the selection based on the regression with measurement error.

I think there is a fundamental problem with selection for significance and a comparison of proportions.

A clearner way to compare the methods woudl be to compute the average strength of evidence for cases that produced significance.

We can then compare the average (mean or median) t-value and see whether random measurement error after selection for significance produces stronger evidence against the null-hypothesis than a measure without random measurement error after selection for significance.

Unless I am missing something here, I think your evidence in Fig. 3 is biased by selection for significance on the measure with measurement error.

Looking forward to hear your comment on your choice of the proportion measure and whether you think it is fair or biased.

]]>Yes, the statistical point is not subtle. We wrote the article because we see the “What does not kill my statistical significance makes it stronger” fallacy all the time, including from researchers with some statistical sophistication. Also we see lots of textbook examples and applied research projects that do not take measurement seriously, I think because people have the naive view that, if they’ve successfully attained statistical significance, that measurement isn’t something they really need to worry about. Although the math is simple, the conceptual error is widespread.

]]>Just to clarify for anyone who is dubious about Gelman and others’ statements here (though to me, they should be obvious):

https://www.facebook.com/photo.php?fbid=10211718000611706&set=p.10211718000611706&type=3&theater

https://www.facebook.com/photo.php?fbid=10211718211576980&set=p.10211718211576980&type=3&theater

https://www.facebook.com/photo.php?fbid=10211718076613606&set=p.10211718076613606&type=3&theater

d = “effect size if measurement were perfect”

d.actual = the effect size in the population of studies that use your measure.

d.select = the d statistic of those studies that were significant

alpha = cronbach’s alpha; .999 is treated as a perfect measure for simulation to work

There are two things working here.

1) The expected value of the effect size in a population of studies that use the crappy measure is lower than the expected value of the effect size in a population of studies free of measurement error. This is where the bias likely comes from. It /is/ true that bad measures underestimate error-free effect sizes. If you had 10000 studies with no measurement error, their effect sizes would be larger than the 10000 studies with measurement error.

2) That said, that point is irrelevant to the replication crisis. The issue is that in small samples, especially, with bad measures, especially, the effect sizes MUST be large by chance to detect something significant. In the population of studies using your crappy measure, the effect size is lower (due to measurement error). But if someone replicates your study using your measure (basically, drawing another study from the population of studies with the measure), they will probably not replicate it. Even if they do a power analysis with your [overestimated] effect size, their target is incorrect.

More concretely:

Without any measurement error, say groups A and B are different with an effect size of d = .4.

With measurement error, the expected value of the effect size is d = .25.

You obtain an estimate of d = .35. It’s significant, by chance, because it’s an overestimate.

That is, d=.35 is an overestimate of the population of individuals who use your measure, but an underestimate of the ‘true’ effect if there were no measurement error; this latter point again, is irrevelant to replication.

Someone sees that d=.35, tries to replicate your study and obtains d = .25; non significant. They may have even conducted a power analysis, aiming for .80 power assuming the true effect is .35, but of course, this is an overestimate, so their true power was less.

They actually obtain the asymptotically correct answer, and fail to replicate the finding.

That is all this article is saying, I think. More measurement error -> Needs high N or over estimate to detect effect. This results in replication failures. This is the same logic for why underpowered studies with significant effects necessarily overestimate the effect; in fact, bad measures only make studies more underpowered, so it’s not even the same logic, it’s quite literally the same thing. Moreover, I realized when doing this simulation that people do power analyses based on “true” measurement error-free effects; they could probably guess what the ‘true’ effect is, then adjust it downward to be the expected value of the effect given a crappy measure — The power analysis would be more “accurate”.

]]>I set error variance for x to 0 (simulating experimental studies with random assignment to groups).

I set error variance for y to 10 (unrealistic large value, but it helps to see the pattern more clearly).

### Number of simulations

n.sim = 1000

# First just the original two plots, high power N = 3000, low power N = 50, true slope = .15

### this is the population effect size

r <- .15

### this is the sample size (N = 50)

N = 50

### this stores the results

sims <- array(0,c(n.sim,4))

### this is error in the predictor variable (x)

xerror <- 0

### this is error in the criteiron variable (y)

yerror <- 10

### this is a loop that runs the simulations

for (i in 1:n.sim) {

### this creates a standard normal predictor variable with N cases

x <- rnorm(N,0,1)

### this creates the criterion variable with residual variance of 1

### importantly this means the variance in y = var(x)*r^2 + 1 = .15^2 + 1 = 1.0225

### the standardized effect size is .15 / (1.0225) ~ .15

y <- r*x + rnorm(N,0,1)

### xx is the unstandardized effect size estimate in a sample with N obervations

xx <-lm(y~x)

### the regression coefficient is stored

sims[i,1]<-summary(xx)$coefficients[2,1]

### this line of code creates a predictor variable (x) with measurement error

x <- x + rnorm(N,0,xerror)

### this line of code creates a predictor variable (x) with measurement error

y <- y + rnorm(N,0,yerror)

### xx is the unstandardized effect size estimate in a sample with N obervations

xx <- lm(y~x)

### the regression coefficient is stored

sims[i,2] <- summary(xx)$coefficients[2,1]

####

#### repeat everything with N = 3000

N = 3000

x <- rnorm(N,0,1)

y <- r*x + rnorm(N,0,1)

xx <-lm(y~x)

sims[i,3] <- summary(xx)$coefficients[2,1]

x <- x + rnorm(N,0,xerror)

y <- y + rnorm(N,0,yerror)

xx <- lm(y~x)

sims[i,4] <- summary(xx)$coefficients[2,1]

} ### End of Loop for simulations

colnames(sims) = c("N=50,ME=N","N=50,ME=Y","N=3000,ME=N","N=3000,ME=Y")

summary(sims)

Here are the results:

N=50,ME=N N=50,ME=Y N=3000,ME=N N=3000,ME=Y

Min. :0.0937 Min. :-0.48535 Min. :0.08544 Min. :-0.50082

1st Qu.:0.1380 1st Qu.: 0.03157 1st Qu.:0.13649 1st Qu.: 0.03192

Median :0.1504 Median : 0.14968 Median :0.15049 Median : 0.14810

Mean :0.1502 Mean : 0.15057 Mean :0.14972 Mean : 0.15256

3rd Qu.:0.1626 3rd Qu.: 0.27042 3rd Qu.:0.16174 3rd Qu.: 0.26372

Max. :0.2039 Max. : 1.24979 Max. :0.21481 Max. : 0.72539

1. We see that there is no systematic bias in the effect size estimate (mean = .15 in all four simulations).

2. We see that measurement error increases the variance in estimates for small and large samples.

3. We see the largest variability for small samples with measurement error.

If we select for significance, we get the largest unstandardized effect size from a small sample with lots of measurement error.

This is solely a function of increased variability.

There is no systematic underestimation due to measurement error in large samples that is reversed in small samples with selection for significance.

The source of the variance in Y is irrelevant. Only the amount of variability matters. We can also change the residual variance in your simulation (fixed at 1) and keep the error variance constant. We would get the same result.

]]>From reading that blog, it really just appears as though they stated the same information differently and with different figures. Their “smoking” gun of the scatter plots on same scale as others, also clearly shows the same information, but is not visually appealing or as clear. It also appeared in the blog that N not being fixed was some oversight, when in fact the figures in the paper make this very clear. It should also be noted that Science articles often have limited space to describe all, whereas the blogger had no such constraint. ]]>

I noticed that you simulated measurement error in X and Y.

This makes sense for correlational study and measurement error in X will attenuate the unstandardized regression coefficient.

However, in experiments, measurement error in X is 0. In this case, the unstandardized regression coefficient is not attenuated by the introduction of measurement error in Y. You can make the variance in Y 1 or 100.

> N1 = 50000

> y1 = r*x + rnorm(N1,0,1)

> y2 = r*x + rnorm(N1,0,100)

> summary(lm(y1 ~ x))$coefficients[2,1]

[1] 0.1503872

> summary(lm(y1 ~ x))$coefficients[2,1]

[1] 0.1503872

So, the first premise of your claim that measurement error/noise leads to an underestimation of effect sizes in large samples is incorrect when we are use unstandardized effect sizes as effect size estimates.

]]>I followed the link. The author writes, “the lack of clarity does potentially also do quite some harm by confusing the reader about important concepts.” I hope the P.S. added to the above post helps clarify. Eric and I were thinking of measurement error in the general sense, and it does seem that there was some confusion because we did not explicitly define the term.

It would be good if it were general practice for every paper to come with its own blog post so that these sorts of issues could be resolved to the satisfaction of the readers. I’ve published hundreds of papers but it’s still so hard to anticipate ahead of time with communication problems might arise.

]]>thanks for your response. I think we are making progress in reaching some understanding of definitions and interpretation of each others perspective.

Also, people have been discussing this in the Psychological Methods Discussion Group, which also help me to clarify some confusion.

https://www.facebook.com/groups/853552931365745/permalink/1278445418876492/

One issue that is becoming clearer is that we need to define what we mean by effect size.

Unstandardized effect sizes like the mean difference between two groups or an unstandardized regression coefficient in a multi-level model are not systematically affected by random measurement error. If the same study is repeated many times, the average effect size estimate is the same with or without measurement error.

Standardized effect sizes relatives the unstandardized effect size to the observed variability. Cohen’s d = Mean.Diff / SD. As random measurement error increases variability, standardized effect sizes decrease with increasing random measurement error.

In small samples, effect sizes estimate are more variable. This means it is more likely to get extreme effect sizes in small samples. They are less likely to be significant, but when they are significant, they are inflated. This is your main point, but there is no downward bias due to measurement error. We are starting with the same effect size as without measurement error. We might say that measurement error has an effect because the increased variability makes it harder to show significance, and so significant results need more inflation to become significant and be reported. But the reason is not that random error attenuates the effect size. The reason is that it inflates the residual variance.

Standardized effect sizes are affected by measurement error and decrease as a function of measurement error. In small samples with lots of sampling error there is more variability around the attenuated population observed effect size. But now we have to effects on the effect size estimates that become significant. Random measurement error makes them lower and selection for significance inflates them. My simulation shows that the selection for significant does not fully compensate for the attenuation due to random measurement error and that the most inflated estimates are obtained without measurement error.

Does this make sense to you?

]]>