Alex Hoffman points me to this interview by Dylan Matthews of education researcher Thomas Kane, who at one point says,

Once you corrected for measurement error, a teacherâ€™s score on their chosen videos and on their unchosen videos were correlated at 1. They were perfectly correlated.

Hoffman asks, “What do you think? Do you think that just maybe, perhaps, it’s possible we aught to consider, I’m just throwing out the possibility that it might be that the procedure for correcting measurement error might, you now, be a little too strong?”

I don’t know exactly what’s happening here, but it might be something that I’ve seen on occasion when fitting multilevel models using a point estimate for the group-level variance. It goes like this: measurement-error models are multilevel models, they involve the estimation of a distribution of a latent variable. When fitting multilevel models, it is possible to estimate the group-level variance to be zero, even though the group-level variance is not zero in real life. We have a penalized-likelihood approach to keep the estimate away from zero (see this paper, to appear in Psychometrika) but this is not yet standard in computer packages. The result is that in a multilevel model you can get estimates of zero variance or perfect correlations because the variation in the data is less than its expected value under the noise model. With a full Bayesian approach, you’d find the correlation could take on a range of possible values, it’s not really equal to 1.

I don’t know what particular procedure Kane used, but in psychology applications I’ve often seen people correct for error by dividing a correlation by the square root of the product of the measures’ reliability coefficients (usually Cronbach’s alpha). See here: http://en.wikipedia.org/wiki/Correction_for_attenuation

The problem is that if Cronbach’s alpha (or whatever reliability procedure you are using) gives an underestimate of a measure’s reliability and/or an inappropriate way to estimate it (such as if the measure does not perfectly fit the model of only one source of common variance plus random errors, as defined in classical test theory), you’ll get unbelievably large “disattenuated” correlations. Again, I haven’t Kane’s paper so I don’t know what he actually did. But in a social-judgment task like what’s being described in the article, if you have the same judges rating both the “chosen” video set and the “unchosen” video set, you could easily have judge-specific factors contributing to their ratings. E.g., Judge A and Judge B each has a distinct way of judging videos which they apply to both the chosen and unchosen videos that they rate. That would lower the between-judge agreement on any given video set but increase the correlation between video sets. Which would over-inflate the “disattenuated” correlations.

I came here to say this.

It’s also worth adding that it makes assumptions about the measurement, and if you test those assumptions, they’re never satisfied.

It must have taken great self-control to type this without mentioning the “8 schools” SAT coaching analysis in Bayesian Data Analysis.

[...] See full story on andrewgelman.com [...]

Re regularized variance estimates: This is very very important, and I wish people would stop publishing model fits with zero variance estimates and/or -1/+1 correction estimates.

On that topic: What are the future plans for blme? blme is great, when I want to help people who probably aren’t quite ready to just go all Bayes and forget about point estimates. lme4 is getting a big update soon, which changes the internals quite a lot, so I’m guessing blme will either fade away or eventually get updated.

*correlation estimates

When lme4 is updated, we will update blme.

Andrew: I think my postdoc contacted Vince Dorie in the last day or two to say that we have done some internal reorganization of (the development version of) lme4 to try to make it more modular and easier to build extensions (like blme) on top of, and would appreciate feedback from downstream package developers …

Ben:

Another option is we could talk about including blme as part of lme4. That is, instead of having the new function blmer/bglmer, you could add some additional arguments to lmer/glmer to allow for prior distributions, with some setting so that the user could get the blmer/bglmer defaults as an easy option. If this were all in the standard lmer/glmer function, that would make it accessible to more people. And I don’t think it should be too hard on your end, given that it’s only a small modification to your existing functions (just adding in a penalty right before the computation of the marginal likelihood).

We could talk about adding sim() into your package too!

What gets me is mostly substantive.

We a high prestige policy-maker making specific recommendations to street-level bureacrats based upon literally impossible-to-believe statistical findings that demonstrate no understanding of how the relevant policies work.

* Despite Kane’s implications, the goal of teacher evaluation is NOT to rank teacher. So, it’s not clear where he is getting that.

* Despite Kane’s implication, teacher evaluation has long been multi-dimensional. A binary composite/summary score has long been required, but we are moving away from that. Nationally, districts (and states) are moving towards either Charlotte Danielson’s or Robert Marzano’s frameworks for teacher evaluation. Evalautors (e.g. school principals) have to rate teachers are a variety of aspects of theit teaching. Even if (ha!) the rankings stayed consistent, how well teachers demonstrate particular aspects of their skill and craft could vary.

* Clearly, there has some overcorrection and therefore LOSS of information through the statistical techniques used. This calls all of the findings into question. So long as you are willing to lose information, we can come up with any result you want. For such a high profile study into such an important topic to be content with such obviously lousy results, is — quite frankly — shocking. To brag about it? It costs the entire effort enormous amounts of credibility.

Or, it should.

But it won’t. I know. Journalists lack statistical knowledge or substantive knowledge of the field in question. Folks who understand statistics lack substantive knoweldge of the field, and folks who are expert in this field lack substantive knowledge of statistics.

When high level policy-makers and policians lead research, how much are we going to sacrifice to support their hopes, dreams and delusions? How much are we going to let their ignorance and angendas run the show?