I finally got around to reading and skimming the MATHEMATICA reports on VAM for schools and individual teachers in DCPS.
Since I had to abbreviate my remarks, here is what I actually said:
I am Guy Brandenburg, retired DCPS mathematics teacher.
To depart from my text, I want to start by proposing a solution: look hard at the collaborative assessment model being used a few miles away in Montgomery County [MD] and follow the advice of Edwards Deming.
Even though I personally retired before [the establishment of the] IMPACT [teacher evaluation system], I want to use statistics and graphs to show that the Value-Added measurements that are used to evaluate teachers are unreliable, invalid, and do not help teachers improve instruction. To the contrary: IVA measurements are driving a number of excellent, veteran teachers to resign or be fired from DCPS to go elsewhere.
Celebrated mathematician John Ewing says that VAM is “mathematical intimidation” and a “modern, mathematical version of the Emperor’s New Clothes.”
One of my colleagues was able to pry the value-added formula [used in DC] from [DC data honcho] Jason Kamras after SIX MONTHS of back-and-forth emails. [Here it is:]
One problem with that formula is that nobody outside a small group of highly-paid consultants has any idea what are the values of any of those variables.
In not a single case has the [DCPS] Office of Data and Accountability sat down with a teacher and explained, in detail, exactly how a teacher’s score is calculated, student by student and class by class.
Nor has that office shared that data with the Washington Teachers’ Union.
I would ask you, Mr. Catania, to ask the Office of Data and Accountability to share with the WTU all IMPACT scores for every single teacher, including all the sub-scores, for every single class a teacher has.
Now let’s look at some statistics.
My first graph is completely random data points that I had Excel make up for me [and plot as x-y pairs].
Notice that even though these are completely random, Excel still found a small correlation: r-squared was about 0.08 and r was about 29%.
Now let’s look at a very strong case of negative correlation in the real world: poverty rates and student achievement in Nebraska:
The next graph is for the same sort of thing in Wisconsin:
Again, quite a strong correlation, just as we see here in Washington, DC:
Now, how about those Value-Added scores? Do they correlate with classroom observations?
Mostly, we don’t know, because the data is kept secret. However, someone leaked to me the IVA and classroom observation scores for [DCPS in] SY 2009-10, and I plotted them [as you can see below].
I would say this looks pretty much no correlation at all. It certainly gives teachers no assistance on what to improve in order to help their students learn better.
And how stable are Value-Added measurements [in DCPS] over time? Unfortunately, since DCPS keeps all the data hidden, we don’t know how stable these scores are here. However, the New York Times leaked the value-added data for NYC teachers for several years, and we can look at those scores to [find out]. Here is one such graph [showing how the same teachers, in the same schools, scored in 2008-9 versus 2009-10]:
That is very close to random.
How about teachers who teach the same subject to two different grade levels, say, fourth-grade math and fifth-grade math? Again, random points:
One last point:
Mayor Gray and chancellors Henderson and Rhee all claim that education in DC only started improving after mayoral control of the schools, starting in 2007. Look for yourself [in the next two graphs].
Notice that gains began almost 20 years ago, long before mayoral control or chancellors Rhee and Henderson, long before IMPACT.
To repeat, I suggest that we throw out IMPACT and look hard at the ideas of Edwards Deming and the assessment models used in Montgomery County.
Testimony of Guy Brandenburg, retired DCPS mathematics teacher before the DC City Council Committee on Education Roundtable, December 14, 2013 at McKinley Tech
Hello, Mr. Catania, audience members, and any other DC City Council members who may be present. I am a veteran DC math teacher who began teaching in Southeast DC about 35 years ago, and spent my last 15 years of teaching at Alice Deal JHS/MS. I taught everything from remedial 7th grade math through pre-calculus, as well as computer applications.
Among other things, I coached MathCounts teams at Deal and at Francis JHS, with my students often taking first place against all other public, private, and charter schools in the city and going on to compete against other state teams. As a result, I have several boxes full of trophies and some teaching awards.
Since retiring, I have been helping Math for America – DC (which is totally different from Teach for America) in training and mentoring new but highly skilled math teachers in DC public and charter schools; operating a blog that mostly concerns education; teaching astronomy and telescope making as an unpaid volunteer; and also tutoring [as a volunteer] students at the school closest to my house in Brookland, where my daughter attended kindergarten about 25 years ago.
But this testimony is not about me; as a result, I won’t read the previous paragraphs aloud.
My testimony is about how the public is being deceived with bogus statistics into thinking things are getting tremendously better under mayoral control of schools and under the chancellorships of Rhee and Henderson.
In particular, I want to show that the Value-Added measurements that are used to evaluate teachers are unreliable, invalid, and do not help teachers improve their methods of instruction. To the contrary: IVA measurements are driving a number of excellent, veteran teachers to resign or be fired from DCPS to go elsewhere.
I will try to show this mostly with graphs made by me and others, because in statistics, a good scatter plot is worth many a word or formula.
John Ewing, who is the president of Math for America and is a former executive director of the American Mathematical Society, wrote that VAM is “mathematical intimidation” and not reliable. I quote:
In case you were wondering how the formula goes, this is all that one of my colleagues was able to pry from Jason Kamras after SIX MONTHS of back-and-forth emails asking for additional information:
One problem with that formula is that nobody outside a small group of highly-paid consultants has any idea what are the values of any of those variables. What’s more, many of those variables are composed of lists or matrices (“vectors”) of other variables.
In not a single case has the Office of Data and Accountability sat down with a teacher and explained, in detail, exactly how a teachers’ score is calculated, student by student, class by class, test score by test score.
Nor has that office shared that data with the Washington Teachers’ Union.
It’s the mathematics of intimidation, lack of accountability, and obfuscation.
I would ask you, Mr. Catania, to ask the Office of Data and Accountability to share with the WTU all IMPACT scores for every single teacher, including all the sub-scores, such as those for IVA and classroom observations.
To put a personal touch to my data, one of my former Deal colleagues shared with me that she resigned from DCPS specifically because her IVA scores kept bouncing around with no apparent reason. In fact, the year that she thought she did her very best job ever in her entire career – that’s when she earned her lowest value-added score. She now teaches in Montgomery County and recently earned the distinction of becoming a National Board Certified teacher – a loss for DCPS students, but a gain for those in Maryland.
Bill Turque of the Washington Post documented the case of Sarah Wysocki, an excellent teacher with outstanding classroom observation results, who was fired by DCPS for low IVA scores. She is now happily teaching in Virginia. I am positive that these two examples can be multiplied many times over.
Now let’s look at some statistics. As I mentioned, in many cases, pictures and graphs speak more clearly than words or numbers or equations.
My first graph is of completely random data points that should show absolutely no correlation with each other, meaning, they are not linked to each other in any way. I had my Excel spreadsheet to make two lists of random numbers, and I plotted those as the x- and y- variables on the following graph.
I asked Excel also to draw a line of best fit and to calculate the correlation coefficient R and R-squared. It did so, as you can see, R-squared is very low, about 0.08 (eight percent). R, the square root of R-squared, is about 29 percent.
Remember, those are completely random numbers generated by Excel.
Now let’s look at a very strong correlation of real numbers: poverty rates and student achievement in a number of states. The first one is for Nebraska.
R would be about 94% in this case – a very strong correlation indeed.
The next one is for Wisconsin:
Again, quite a strong correlation – a negative one: the poorer the student body, the lower the average achievement, which we see repeated in every state and every country in the world. Including DC, as you can see here:
Now, how about those Value-Added scores? Do they correlate with classroom observations?
Mostly, we don’t know, because the data is kept secret. However, someone leaked to me the IVA and classroom observation scores for all DCPS teachers for SY 2009-10, and I plotted them. Is this a strong correlation, or not?
I would say this looks pretty much like no correlation at all. What on earth are these two things measuring? It certainly gives teachers no assistance on what to improve in order to help their students learn better.
And how stable are Value-Added measurements over time? If they are stable, that would mean that we might be able to use them to weed out the teachers who consistently score at the bottom, and reward those who consistently score at the top.
Unfortunately, since DCPS keeps all the data hidden, we don’t exactly know how stable these scores are here. However, the New York Times leaked the value-added data for NYC teachers for several years, and we can look at those scores to see.
Here is one such graph:
That is very close to random.
How about teachers who teach the same subject to two different grade levels (say, fourth-grade math and fifth-grade math)? Again, random points:
One thing that all veteran teachers agree on is that they stunk at their job during their first year and got a lot better their second year. This should show up on value-added graphs of year 1 versus year 2 scores for the same teachers, right?
Take a look:
One last point:
Mayor Gray and chancellors Henderson and Rhee all claim that education in DC only started improving after mayoral control of the schools, starting in 2007.
Graphs and the NAEP show a different story. We won’t know until next week how DCPS and the charter schools did, separately, for 2013, but the following graphs show that reading andmath scores for DC fourth- and eighth-graders have been rising fairly steadily for nearly twenty years, or long before mayoral control or the appointments of our two chancellors (Rhee and Henderson).
I haven’t read this yet, but it looks useful:
I recall many discussions in the rightwing think tanks to which I once belonged about how the schools and the teaching profession would be elevated if we could only judge teachers by the performance of their students and fire the lowest performing teachers every year. I recall asking, “where will the new teachers come from?” My colleagues said there would never be a shortage because there are so many people who prepared to be teachers but never entered the classroom. They would rush to fill the newly available jobs. What they never considered was the possibility that their brilliant theory was wrong. That judging teachers by the test scores of their students was unreliable and invalid; that doing so would drive out many find teachers and make teaching an undesirable profession; would indeed wipe out professionalism itself.
From a comment on the blog:
50% of evaluation based on end of course testing is so demotivating and humiliating that I am definitely getting out of teaching asap. Two years of bad test scores means suspension and potential loss of license. Seventy hour work weeks, failing technology, rotating cast of half my class load with various medical conditions that impede cognitive function. Adaptable, hard working, using differentiated learning and hands on learning/multimodal approaches does not mean jack now. Teachers are not able to control the tests, cannot develop multiple means for students to demonstrate mastery. So half my well meaning students will christmas tree their end of course test and my own family will suffer the consequences when I lose my job. Bleaker future than the past five with consistent pay cuts and benefits cut. Furloughs are a yearly experience now. I am very well educated and a top graduate in my field and hold multiple degrees so the stereotype of the poorly educated teacher without options or abilities does not fit. It doesn’t fit for the majority of teachers I know.
This evaluation system is the last straw. I cajoled PTA parents to put pressure on our district to stop this eval system. There are several well respected anchor teachers who are now making tracks to change fields. What a waste. New administration is in love with drill and kill, parents are blinded by smoke and mirrors of test scores as a metric of anything.
Here is another study that shows that Value-Added measurements for teachers are extremely unstable over time. It’s by one Mercedes K. Schneider, and it was done for Louisiana. You can read all of the details yourself. Here I am going to reproduce a couple of the key tables:
and I also quote some of her explanation:
I quote from a paper studying whether value-added scores for the same teachers tend to be consistent. In other words, does VAM allow us a chance to pick out the crappy teachers and give bonuses to the good one?
The answer, in complicated language, is essentially NO, but here is how they word it:
“Recently, a number of school districts have begun using measures of teachers’ contributions to student test scores or teacher “value added” to determine salaries and other monetary rewards.
In this paper we investigate the precision of valueadded measures by analyzing their inter-temporal stability.
We find that these measures of teacher productivity are only moderately stable over time, with year-to-year correlations in the range of 0.2-0.3.”
Or in plain English, and if you know anything at all about scatter plots and linear correlation, those scores wander all over the place and should never be used to provide any serious evidence about anything. Speculation, perhaps, but not policy or hiring or firing decisions of any sort.
They do say that they have some statistical tricks that allow them to make the correlation look better, but I don’t trust that sort of thing. It’s not real.
Here’s a table from the paper. Look at those R values, and note that if you squared those correlation constants (go ahead, use your calculator on your cell phone) you get numbers that are way, way smaller – like what I and Gary Rubenstein reported concerning DCPS and NYCPS.
For your convenience, I circled the highest R value, 0.61, in middle schools on something called the normed FCAT-SSS, whatever that is (go ahead and look it up if it interests you) in Duval county, Florida, one of the places where they had data. I also circled the lowest R value, 0.07, in Palm Beach county, on the FCAT-NRT, whatever that is.
I couldn’t resist, so 0.56^2 is about 0.31 as an r-squared, which is moderate. There is only one score anywhere near that high 0.56, out of 24 such correlation calculations. The lowest value is 0.07 and if we square that and round it off we get an r-squared value of 0.005, shockingly low — essentially none at all.
The median correlation constant is about 0.285, which I indicated by circling two adjacent values of 0.28 and 0.29 in green. If you square that value you get r^2=0.08, which is nearly useless. Again.
I’m really sorry, but even though this paper was published four years ago, it’s still under wraps, or so it says?!?! I’m not supposed to quote from it? Well, to hell with that. it’s important data, for keereissake!
The title and authors are as follows, and perhaps they can forgive me. I don’t know how to contact them anyway. Does anybody have their contact information? Here is the title, credits, and warning:
THE INTERTEMPORAL STABILITY OF TEACHER EFFECT ESTIMATES *
Daniel F. McCaffrey; Tim R. Sass; J. R. Lockwood
The RAND Corporation; Florida State University; The RAND Corporation
Original Version: April 9, 2008
This Version: June 27, 2008
*This paper has not been formally reviewed and should not be cited, quoted, reproduced, or retransmitted without the authors’ permission. This material is based on work supported by a supplemental grant to the National Center for Performance Initiatives funded by the United States Department of Education, Institute of Education Sciences. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of these organizations.
A number of academic researchers have shown that ‘value-added’ methodologies are nearly useless in providing any information that principals, teachers, and other policy makers can actually use to make any decisions whatsoever. One such researcher is Sean Corcoran. I reprint here a press release from two years ago, sponsored by the Annenberg Institute, (link to original document)
Unfortunately, the links in the press release don’t seem to work. Here is a link to the full report by Dr. Corcoran.
September 16, 2010
VALUE-ADDED TEACHER ASSESSMENTS EARN LOW GRADES FROM NYU ECONOMIST
Sean Patrick Corcoran
New York University
Annenberg Institute for School Reform
NEW YORK — Value-added assessments of teacher effectiveness are a “crude indicator” of the contribution that teachers make to their students’ academic outcomes, asserts Sean P. Corcoran, assistant professor of educational economics at New York University’s Steinhardt School of Culture, Education and Human Development, and research fellow at the Institute for Education and Social Policy, in a paper issued today as part of the “Education Policy for Action” series of research and policy analyses by scholars convened by the Annenberg Institute for School Reform at Brown University.
“The promise that value-added systems can provide a precise, meaningful and comprehensive picture is much overblown,” argues Corcoran whose research report is entitled Can Teachers be Evaluated by Their Students’ Test Scores? Should they Be? The Use of Value-Added Measures of Teacher Effectiveness in Policy and Practice. “Teachers, policy-makers and school leaders should not be seduced by the elegant simplicity of value-added measures. Policy-makers, in particular, should be fully aware of their limitations and consider whether their minimal benefits outweigh their cost.”
> To view the entire report visit
> The “Education Policy for Action” series is funded by a grant from
the Robert Sterling Clark Foundation.
Value-added models — the centerpiece of a national movement to evaluate, promote, compensate and dismiss teachers based in part on their students’ test scores — have proponents throughout the country, including school systems in New York City, Chicago, Houston and Washington, D.C. In theory, a teacher’s “value-added” is the unique contribution he or she makes to students’ achievement that cannot be attributed to any other current or past student, family, teacher, school, peer or community influence. In practice, states Corcoran, it is exceptionally difficult to isolate a teacher’s unique effect on academic achievement.
“The successful use of value-added requires a high level of confidence in the attribution of achievement gains to specific teachers,” he says. “Given one year of test scores, it’s impossible to distinguish between the teacher’s effect and other classroom-specific factors. Over many years, the effects of other factors average out, making it easier to infer a teacher’s impact. But this is little comfort to a teacher or school leader searching for actionable information today.”
In October 2009, the National Academies’ National Research Council issued a statement that applauded the Department of Education’s proposed use of assessment systems that link student achievement to teachers in Race to the Top initiatives, but cautioned the use of value-added approaches for evaluation purposes, citing that “too little research has been done on these methods’ validity to base high-stakes decisions about teachers on them.”
Corcoran’s research examines the value-added systems used in New York City’s Teacher Data Reports, and Houston’s ASPIRE program (Accelerating Student Progress, Increasing Results and Expectations). Among his concerns surrounding them, he concludes that the standardized tests used to support these systems are inappropriate for value-added measurement.
“Value-added assessment works best when students are able to receive a single numeric test score every year on a continuous developmental scale,” states Corcoran, meaning that the scale does not depend on grade-specific content but rather progresses across grade levels. Neither the Texas nor New York state test was designed on such a scale. Moreover, the set of skills and subjects that can be adequately assessed in this way is remarkably small, he argues, suggesting that value-added systems will ignore much of the work teachers do.
“Not all subjects are or can be tested, and even within tested subject areas, only certain skills readily conform to standardized testing,” he says. “Despite that, value-added measures depend exclusively on such tests. State tests are often predictable in both content and format, and value-added rankings will tend to reward those who take the time to master the predictability of the test.”In practice, the biggest obstacle to value-added assessments is their high level of imprecision, he argues.
“A teacher ranked in the 43rd percentile on New York City’s Teacher Data Report may have a range of possible rankings from the 15th to the 71st percentile after taking statistical uncertainty into account,” says Corcoran. He finds that the majority of teachers in New York City’s Teacher Data Reports cannot be statistically distinguished from the 60 percent or more of other teachers in the district.
“With this level of uncertainty, one cannot differentiate between below average, average, and above average teachers with confidence. At the end of the day, it’s isn’t clear what teachers and their principals are supposed to do with this information.”
Corcoran grants that some uncertainty is inevitable in value-added measurement but questions whether value-added measures are precise enough to be useful in high-stakes decision-making or even for professional development. Novice teachers have the most to gain from performance feedback, he contends, yet value-added scores for these teachers are the least reliable.
The notion that a statistical model could isolate each teacher’s unique contribution to their students’ educational outcomes is a powerful one, acknowledges Corcoran. With such information, one could not only devise systems that reward teachers with demonstrated records of classroom success and remove teachers who do not, but also create a school climate in which teachers and principals work constructively with their test results to make positive instructional and organizational changes.
“Few can deny the intuitive appeal of these tools,” says Corcoran. “Teacher quality is an immensely important resource, and research has found that teachers can and do vary in their effectiveness. However, these evaluation tools have limitations and shortcomings that are not understood or apparent to interested stakeholders, or even to value-added advocates.”
Adds Corcoran: “Research on value-added remains in its infancy, and it is likely that these methods — and the tests on which they are based — will continue to improve over time. The simple fact that teachers and principals are receiving regular and timely feedback on their students’ achievement is an accomplishment in and of itself. It’s hard to argue that stimulating conversation around improving student achievement is not a positive thing, but teachers, policy-makers, and school leaders should not be seduced by the simplicity of value added.”
# # #© Annenberg Institute for School Reform
Both show the lack of correlation between a teachers’ score on the exceedingly complex Teaching and Learning Framework classroom observation scores on the one hand, and their scores on the Individual Added-Value measurement scheme either in math or reading or both, depending on what subject(s) and grade levels that they taught.
Gary’s graph is, of course, populated by lots of bright red triangles; mine has little blue squares. His grid is missing vertical lines, so mine is clearly better. (joke !) But look even more carefully – you can see that the individual triangles and squares are in the identical places.
This shows that Excel, when given the same data, will produce much the same graph.
It’s really easy to do, by the way. You should try it. Here is the original data table.
The Correlation Between ‘Value-Added’ Scores and Observation Scores in DCPS under IMPACT is, in fact, Exceedingly Weak
As I suspected, there is nearly no correlation between the scores obtained by DCPS teachers on two critical measures.
I know this because someone leaked me a copy of the entire summary spreadsheet, which I will post on the web at Google Docs shortly.
As usual, a scatter plot does an excellent job of showing how ridiculous the entire IMPACT evaluation system is. It doesn’t predict anything to speak of.
Here is the first graph.
Notice that the r^2 value is quite low: 0.1233, or about 12%. Not quite a random distribution, but fairly close. Certainly not something that should be used to decide whether someone gets to keep their job or earn a bonus.
The Aspen Institute study apparently used R rather than r*2; they reported R values of about 0.35, which is about what you get when you take the square root of 0.1233.
Here is the second graph, which plots teachers’ ranks on the classroom observations versus their ranks on the Value Added scores. Do you see any correlation?
Remember, this is the correlation that Jason Kamras said was quite strong.