Value-added assessment: What went wrong?

Jacob Hartog writes the following in reaction to my post on the use of value-added modeling for teacher assessment:

What I [Hartog] think has been inadequately discussed is the use of individual model specifications to assign these teacher ratings, rather than the zone of agreement across a broad swath of model specifications.

For example, the model used by NYCDOE doesn’t just control for a student’s prior year test score (as I think everyone can agree is a good idea.) It also assumes that different demographic groups will learn different amounts in a given year, and assigns a school-level random effect. The result is that, as was much ballyhooed at the time of the release of the data,the average teacher rating for a given school is roughly the same, no matter whether the school is performing great or terribly. The headline from this was “excellent teachers spread evenly across the city’s schools,” rather than “the specification of these models assume that excellent teachers are spread evenly across the city’s schools.”

To be partisan for a moment, imagine using a multi-level model to assess the efficacy of basketball players that imposed a team-level random effect, and controls for covariates representing player budget and perceived coaching quality: we might easily ‘discover’ that the average player on the Charlotte Hornets was as good as the average player on the Chicago Bulls, when really a)that is an effect of the model design, b)good players are what makes a team good, just as good teachers are by-and-large what makes a school good. I also think that our baseline assumption should be that if demographic groups are learning different amounts in a given year, it is because they are getting different qualities of education, not because their intrinsic coefficients are different from one another.

Jesse Rothstein had a good paper a couple years back that showed that due to dynamic tracking of students, a 5th grade teacher’s value-added score is often correlated with their students’ 4th grade progress, which it shouldn’t be if these models are unbiased. If I was more sophisticated, I’d try to extend Rothstein’s paper to show that dynamic sorting of teachers into high and low-functioning schools (as anyone can tell you happens in NYC) messes up the models just as badly as dynamic sorting of students does.

I should add that as someone who taught in NYC schools for 8 years, there’s nothing wrong with measurement. The pre-existing principal-based observation and evaluation system was completely terrible, and it is totally reasonable to combine evaluations with test-based measurement in making decision. Analytically privileging the results of tests over other forms of evaluation, let alone assigning a percentile to the coefficient on a single (remarkably complex) model and publishing it in the newspapers, is absolutely bonkers.

I’d appreciate comments showing me where I’m wrong.

P.S. Rubinstein wrote a very funny and valuable book called “Reluctant Disciplinarian” about his early years teaching, that’s very much worth a read for new K-12 teachers. I’m not totally sold on his blog, though.

P.P.S. Here’s what I think is the actual model NYCDOE uses.

My reply:

I actually had some connection with the DOE’s assessment project; one of the people involved was Jim Liebman, a law professor with whom I worked on an earlier, unrelated project. My impression was that they were making a lot of compromises in order to create assessments which in their view were simple and transparent. As a statistician, I am torn; on one hand, I generally prefer to do more modeling and get the best estimates that I can do; on the other hand, I see the appeal of transparency.

I’ve also had several conversations with Jonah Rockoff, an economist who’s done some studies of teacher effects in NYC schools. I have not looked at his analysis in detail but it all looked very impressive to me. I recall Rockoff telling me that there was not a lot of evidence for good teachers sorting into good schools. This is not to say that average teacher effects are zero within each school; rather, I think he looked for aggregate differences between schools (after controlling for student and teacher differences) and didn’t find much.

As to the larger questions, one reason I am sympathetic to measurement (pre-tests and post-tests) is that I don’t do enough of it myself (see my recent ethics column with Eric Loken in Chance magazine). At one point our efforts on the NYC schools were financially supported by a wealthy public-spirited friend of mine; our plan was to increase citizen involvement by opening up the process as much as possible and making data available to parents and teachers as well as to administrators. Somehow none of this happened; I assign much of the blame for this failure on myself as I did not follow this all up.

In any case, as noted in my earlier post I’ve been surprised that these sorts of quantitative teacher assessments have been such a flop. It makes sense that many teachers unions have opposed them, but even the supporters of these assessments don’t seem to be out there defending them. Perhaps the combination of political opposition and problems with the methods have been too much.

To which Hartog wrote:

My bias is that it is hard to defend processes and measures that almost no parents understand. I’d be curious if NYC had just posted average pretest and posttest scores by teacher, instead of a VA percentile, if it would have made any difference to its political palatability. Perhaps not.

P.S. Wanna read more? Bill Harris sends along this discussion from Tom Fiddaman, who takes a Deming-esque position (see the last paragraph of the linked blog post).

7 thoughts on “Value-added assessment: What went wrong?

  1. Shouldn’t there be a lot of discomfort about the difference between effects of causes versus causes of effects?

    Distinguishing 3 _opportunities_
    1. Predicting students who get teachers with these features well do better worse.
    2. Trying to discern if teachers were counterfactually changed on these features would students do better (on some average basis).
    3. Trying to discern if a given teacher had a positive or negative impact on their students (effects of causes).

    Most of the statistics discipline is (I believe rightly) wary of 3 and its usually carried out on some adversarial basis (legal action or annual review by one’s manager.)

    For the discernment in 2, a current succinct commentary by John Ioannidis su
    ggests the wider community might soon be catching on the real challenges here. Commentary: adjusting for bias: a user’s guide to performing plastic surgery on meta-analyses of observational studies. International journal of epidemiology 2011;40(3):777-9.

    Also, there also often are wider managerial issues to be wary about. My first encounter with this was before I went into statistics when I was consulting on a data base to track public health dentistry. Neat opportunity to feed back to dentists if they were finding less cavities – say on the lower left side, but also to catch them _stealing_ silver filling supplies. One of my tasks was to makes sure law suits did not result from a premature evaluation of the last possibility.

    But then the new mayor shut down the whole dental health program, apparently because it seemed too left wing and I ended up going into a graduate Biostatistics program.

  2. “To be partisan for a moment, imagine using a multi-level model to assess the efficacy of basketball players that imposed a team-level random effect: we might easily ‘discover’ that the average player on the Charlotte Hornets was as good as the average player on the Chicago Bulls, when really a)that is an effect of the model design, and b)good players are what makes a team good, just as good teachers are by-and-large what makes a school good.”

    Although I agree with the point that if teachers are non-randomly sorted into classrooms in the same way that students are non-randomly sorted into classrooms that the inclusion of classroom-level covariates will mask true teacher effects, the basketball analogy is a bit off. In this analogy, the players are more analogous to students, and the value-added analysis would be better applied to the coach. Therefore, the argument that the team is good because the players are good would seem to reinforce the notion that good schools are good because they have good students. A value-added model for the NBA would consider all players to be equal, with the winning and losing dependent only on the value added by the coach.

  3. Agree with @Ed; isn’t there just as much qualitative evidence that it is the *students* that make the school good as there is that it is the *teachers*? Indeed, the non-quantitative, common-sense argumentation that Hartog uses is exactly the kind of thing good statistical analyses are supposed to confront. I don’t disagree, however, with the general point that the way testing is done in school is not good.

  4. Thank you for your response, Andrew.

    David, of course the best way to tell what a student’s score is going to be at the end of the year is to look at their score at the beginning of the year– that’s why you include a pretest as one of your covariates. While I’m no expert on these kinds of models, my point was intended to be quantitative, rather than “common-sensical” (a distinction I rarely attain): if there is an unknown but measurable quantity called “teacher value added,” and this quantity is correlated with the school value added, then adding a covariate for school-wide quality will bias your estimate of teacher value-added.

    Ed, I would agree that in the ideal, the students are the players, and the teachers are the coach, though that analogy may suggest a level of investment by students that may be more misleading than helpful.

  5. I think the most important thing you’re all missing is that the Charlotte Hornets haven’t existed for 10 years now! It’s New Orleans Hornets and Charlotte Bobcats these days.

Comments are closed.