Jacob Hartog writes the following in reaction to my post on the use of value-added modeling for teacher assessment:
What I [Hartog] think has been inadequately discussed is the use of individual model specifications to assign these teacher ratings, rather than the zone of agreement across a broad swath of model specifications.
For example, the model used by NYCDOE doesn’t just control for a student’s prior year test score (as I think everyone can agree is a good idea.) It also assumes that different demographic groups will learn different amounts in a given year, and assigns a school-level random effect. The result is that, as was much ballyhooed at the time of the release of the data,the average teacher rating for a given school is roughly the same, no matter whether the school is performing great or terribly. The headline from this was “excellent teachers spread evenly across the city’s schools,” rather than “the specification of these models assume that excellent teachers are spread evenly across the city’s schools.”
To be partisan for a moment, imagine using a multi-level model to assess the efficacy of basketball players that imposed a team-level random effect, and controls for covariates representing player budget and perceived coaching quality: we might easily ‘discover’ that the average player on the Charlotte Hornets was as good as the average player on the Chicago Bulls, when really a)that is an effect of the model design, b)good players are what makes a team good, just as good teachers are by-and-large what makes a school good. I also think that our baseline assumption should be that if demographic groups are learning different amounts in a given year, it is because they are getting different qualities of education, not because their intrinsic coefficients are different from one another.
Jesse Rothstein had a good paper a couple years back that showed that due to dynamic tracking of students, a 5th grade teacher’s value-added score is often correlated with their students’ 4th grade progress, which it shouldn’t be if these models are unbiased. If I was more sophisticated, I’d try to extend Rothstein’s paper to show that dynamic sorting of teachers into high and low-functioning schools (as anyone can tell you happens in NYC) messes up the models just as badly as dynamic sorting of students does.
I should add that as someone who taught in NYC schools for 8 years, there’s nothing wrong with measurement. The pre-existing principal-based observation and evaluation system was completely terrible, and it is totally reasonable to combine evaluations with test-based measurement in making decision. Analytically privileging the results of tests over other forms of evaluation, let alone assigning a percentile to the coefficient on a single (remarkably complex) model and publishing it in the newspapers, is absolutely bonkers.
I’d appreciate comments showing me where I’m wrong.
P.S. Rubinstein wrote a very funny and valuable book called “Reluctant Disciplinarian” about his early years teaching, that’s very much worth a read for new K-12 teachers. I’m not totally sold on his blog, though.
P.P.S. Here’s what I think is the actual model NYCDOE uses.
I actually had some connection with the DOE’s assessment project; one of the people involved was Jim Liebman, a law professor with whom I worked on an earlier, unrelated project. My impression was that they were making a lot of compromises in order to create assessments which in their view were simple and transparent. As a statistician, I am torn; on one hand, I generally prefer to do more modeling and get the best estimates that I can do; on the other hand, I see the appeal of transparency.
I’ve also had several conversations with Jonah Rockoff, an economist who’s done some studies of teacher effects in NYC schools. I have not looked at his analysis in detail but it all looked very impressive to me. I recall Rockoff telling me that there was not a lot of evidence for good teachers sorting into good schools. This is not to say that average teacher effects are zero within each school; rather, I think he looked for aggregate differences between schools (after controlling for student and teacher differences) and didn’t find much.
As to the larger questions, one reason I am sympathetic to measurement (pre-tests and post-tests) is that I don’t do enough of it myself (see my recent ethics column with Eric Loken in Chance magazine). At one point our efforts on the NYC schools were financially supported by a wealthy public-spirited friend of mine; our plan was to increase citizen involvement by opening up the process as much as possible and making data available to parents and teachers as well as to administrators. Somehow none of this happened; I assign much of the blame for this failure on myself as I did not follow this all up.
In any case, as noted in my earlier post I’ve been surprised that these sorts of quantitative teacher assessments have been such a flop. It makes sense that many teachers unions have opposed them, but even the supporters of these assessments don’t seem to be out there defending them. Perhaps the combination of political opposition and problems with the methods have been too much.
To which Hartog wrote:
My bias is that it is hard to defend processes and measures that almost no parents understand. I’d be curious if NYC had just posted average pretest and posttest scores by teacher, instead of a VA percentile, if it would have made any difference to its political palatability. Perhaps not.
P.S. Wanna read more? Bill Harris sends along this discussion from Tom Fiddaman, who takes a Deming-esque position (see the last paragraph of the linked blog post).