A recent public letter to Arne Duncan points out two key weaknesses in IMPACT, which is the evaluation scheme dreamed up by Jason Kamras and Michelle Rhee:
(1) The correlation between the observational scores and the test-change scores is only 0.34, which is extremely low.
(2) Since IMPACT has been in existence, student test gains in DC have actually decreased.
Here is the applicable part of the letter:
Although there are no formal studies connecting educator evaluation systems that use test score growth data with learning outcomes, there are two recently published reviews of the Washington DC teacher evaluation system IMPACT, which is in some ways the prototype of test-score-based evaluation of educators. It has been in existence for two years. When student scores are used for IMPACT teacher evaluations, (6)its design looks similar to many of the state-adopted models that rely heavily on student scores. For this reason, we thought a closer look might provide insight into educator evaluation systems and in particular the relationship between the two main types of evaluation used in IMPACT: observations and test scores.
What we found out gave us cause for concern.
Ideally, there would be a strong correlation between a teacher’s value-added score and the score derived from careful observations. A correlation of 0.60 and above is generally accepted to be a strong relationship. This would mean that the district is measuring, with each part of the evaluation, something akin to true teacher quality. Yet the DC IMPACT program showed a relationship of only 0.34 between teacher value-added scores and the scores from evaluations (primarily observational) linked the district’s Teaching and Learning Framework observation scores. This modest correlation‖ concern was raised in an evaluative report of IMPACT published by the Aspen Institute. (7)
At one level, this relatively weak relationship between the two components of the IMPACT evaluation is testimony to the district’s wisdom in incorporating both elements in the evaluative system. But at another level, it raises red flags about the reliability and validity of one or both.
Indeed, this is not the first time a lack of a strong relationship was found. A prominent peer-reviewed article published a few months ago found that teachers with ineffective teaching skills nevertheless might have strong VAM scores, especially if they taught high-achieving students. (8)
As a practical matter, this means that some teachers will receive bonuses when they should not, others will not receive bonuses when they should, and still others might be unfairly dismissed—to the detriment of students as well as the teachers themselves. Further, because higher growth scores are correlated to students who enter the class with higher achievement, this system creates a disincentive to teach those with greater disadvantages. That is, even models like DC’s that attempt to control for prior achievement fail to capture the full effect of ongoing advantages and disadvantages.
In light of these concerns, we next looked at the associated student scores since IMPACT was enacted. One would expect that if the system were effective we would see an accelerated increase in student scores as teaching improved due to training, coaching, evaluation and the pressure on teachers to increase test scores. This was not the case.
In 2007, only 37.5% of all DC elementary students (through grade 6) were proficient in reading and 29.3% were proficient in mathematics. By 2009, the final year prior to IMPACT, the percentage of elementary students who were proficient was 49% in both reading and mathematics. There was a dramatic increase between 2007 and 2008, followed by an additional year of growth. However, two years later (during the IMPACT years) half of those gains were lost. The percentage of students proficient in reading and mathematics fell to 43% and 42.3%.
Between 2007 and 2009 the percentage of students proficient in secondary reading increased by over 10 percentage points; the increase in math was nearly 13 percentage points. However, during the IMPACT years there was only a 4% increase in students proficient in reading and a 7% increase in math. Although secondary students did not lose ground during the IMPACT years, progress decelerated.
We note in particular that evaluations based on student test score growth would be more common in elementary schools, covering classroom teachers in grades 4 through 6, as opposed to secondary schools, where that component is only applicable in grades 7 and 8 and even then only for reading and math teachers. And we note that the post-IMPACT results are worse at the elementary level.
These are correlational results, and we cannot make any causal inferences or claims. In fact, a good argument could be made that IMPACT’s effects – good or ill – would not likely be felt so quickly. (9) However, a sound research design attached to pilot programs could carefully address all these issues. Certainly the data we found suggest nothing to be enthusiastic about, even though IMPACT is only one of many factors that may affect scores. Put another way, wouldn’t the children in New York, Colorado, and other states moving toward such systems benefit from solid research evidence from DC, particularly if IMPACT is indeed having a negative effect? And put yet another way, if teachers are being evaluated and dismissed based on the IMPACT data, shouldn’t the program itself also be subject to a summative evaluation? Shouldn’t such evaluations take place before any scaling up of this experimental policy?
(6) For reading and math teachers in grades 4 through 8, the student growth component of the evaluation is set at 50 percent. For other educators, various other evaluation components take on greater importance.
(7) Curtis, R. (2011). District of Columbia Public Schools: Defining Instructional Expectations and Aligning Accountability and Support. Washington DC: The Aspen Institute. (Page 22.)
(8) Hill, H. C., Kapitula, L., and Umland, K.A (2011). Validity Argument Approach to Evaluating Teacher Value-Added Scores. American Educational Research Journal, 48(3), 794-831.
(9) An exception might be the immediate effects on school environment and working conditions.