A New-ish Study Showing Problems with Value-Added Measurements

I haven’t read this yet, but it looks useful:


Published in: on December 12, 2013 at 10:02 am  Leave a Comment  
Tags: , , ,

Another Study Showing that VAM is Bogus and Unstable

Here is another study that shows that Value-Added measurements for teachers are extremely unstable over time. It’s by one Mercedes K. Schneider, and it was done for Louisiana. You can read all of the details yourself. Here I am going to reproduce a couple of the key tables:

stability of VAM ratings or not

and I also quote some of her explanation:

“Each number in the table is a percentage of teachers in the study/actual number of teachers who were first ranked one way using 2008-09 student test scores (reading to the left) then ranked either the same way (bolded diagonal) or a different way (all numbers not bolded) using 2009-10 student test scores (reading at the top). For example, the percentage 4.5% (23 teachers) in Table 6 (immediately above this text) represents the percentage of ELA teachers originally ranked in 2008-09 in the top 91-99% (reading to the left) but reranked in 2009-10 in the bottom 1-10% (reading at the top of the column) given that the teachers changed nothing in their teaching. 
“Thus, these two tables represent how poorly the standardized tests classify teachers (2008-09) then reclassify teachers (2009-10) into their original rankings. Tables 5 and 6 are a test of the consistency of using standardized tests to classify teachers. It is like standing on a bathroom scale; reading your weight; stepping off (no change in your weight); then, stepping on the scale again to determine how consistent the scale is at measuring your weight. Thus, if the standardized tests are stable (consistent) measures, they will reclassify teachers into their original rankings with a high level of accuracy. This high level of accuracy is critical if school systems are told they must use standardized tests to determine employment and merit pay decisions. I have bolded the cells on the diagonals of both tables to show just how unstable these two standardized tests are at classifying then correctly reclassifying teachers. If the iLEAP and LEAP-21 were stable, then the bolded percentages on the diagonals of both tables would be very high, almost perfect (99%).
“Here is what we see from the diagonal in Table 5: 
“If a math teacher is originally ranked as the lowest, without altering his or her teaching, the teacher will likely be re-ranked in the lowest category only 26.8% of the time. Conversely, without altering his/her teaching, a math teacher ranked as the highest would likely be re-ranked in the highest group only 45.8% of the time even if she/he continued to teach the same way. (…)
“A math teacher originally ranked in the highest category will be re-ranked in the middle category 35.1% of the time and re-ranked in the lowest category 1.8% of the time. These alterations in ranking are out of the teacher’s control and do not reflect any change in teaching. Even though 1.8% might seem low, notice that in the study alone, this represented 8 math teachers, 8 real human beings, who could potentially lose their jobs and face the stigma of being labeled “low
“As we did for Table 5, let’s consider the diagonal for Table 6:
“If an ELA teacher is originally ranked as the lowest, without altering his or her teaching, the teacher will likely be re-ranked in the lowest category only 22.3% of the time. Conversely, without altering his/her teaching, an ELA teacher ranked as the highest would likely be re-ranked in the highest group only 37.5% of the time even if she/he continued to teach the same way.  (…)
“An ELA teacher originally ranked in the highest category will be re-ranked in the middle category 37.1% of the time and re-ranked in the lowest category 4.5% of the time. These alterations in ranking are out of the teacher’s control and do not reflect any change in teaching.
“Even though 4.5% might seem low, notice that in the study alone, this represented 23 ELA teachers who could potentially lose their jobs and face the stigma of being labeled “low performers.” “
Your thoughts? To leave a comment, click on the way-too-tiny “Leave a Comment” button below.
Published in: on January 8, 2013 at 12:04 pm  Leave a Comment  
Tags: , ,

Law and Statistics at Work in Combating Rhee-Style Educational Deforms

This article is worth reading a couple of times. It makes the point that teachers and their unions need to look into the legal and statistical framework that supposedly upholds Value-Added Methodologies (VAM) and to challenge both.

I agree!


Published in: on April 18, 2012 at 7:20 am  Leave a Comment  

Teacher VAM scores aren’t stable over time in Florida, either

I quote from a paper studying whether value-added scores for the same teachers tend to be consistent. In other words, does VAM allow us a chance to pick out the crappy teachers and give bonuses to the good one?

The answer, in complicated language, is essentially NO, but here is how they word it:

“Recently, a number of school districts have begun using measures of teachers’ contributions to student test scores or teacher “value added” to determine salaries and other monetary rewards.

In this paper we investigate the precision of valueadded measures by analyzing their inter-temporal stability.

We find that these measures of teacher productivity are only moderately stable over time, with year-to-year correlations in the range of 0.2-0.3.”

Or in plain English, and if you know anything at all about scatter plots and linear correlation, those scores wander all over the place and should never be used to provide any serious evidence about anything. Speculation, perhaps, but not policy or hiring or firing decisions of any sort.

They do say that they have some statistical tricks that allow them to make the correlation look better, but I don’t trust that sort of thing.  It’s not real.

Here’s a table from the paper. Look at those R values, and note that if you squared those correlation constants (go ahead, use your calculator on your cell phone) you get numbers that are way, way smaller – like what I and Gary Rubenstein reported concerning DCPS and NYCPS.

For your convenience, I circled the highest R value, 0.61, in middle schools on something called the normed FCAT-SSS, whatever that is (go ahead and look it up if it interests you) in  Duval county, Florida, one of the places where they had data. I also circled the lowest R value, 0.07, in Palm Beach  county, on the FCAT-NRT, whatever that is.

I couldn’t resist, so 0.56^2 is about 0.31 as an r-squared, which is moderate. There is only one score anywhere near that high 0.56, out of 24 such correlation calculations. The lowest value is 0.07 and if we square that and round it off we get an r-squared value of 0.005, shockingly low — essentially none at all.

The median correlation constant is about 0.285, which I indicated by circling two adjacent values of 0.28 and 0.29 in green. If you square that value you get r^2=0.08, which is nearly useless. Again.

I’m really sorry, but even though this paper was published four years ago, it’s still under wraps, or so it says?!?! I’m not supposed to quote from it? Well, to hell with that. it’s important data, for keereissake!

The title and authors are as follows, and perhaps they can forgive me. I don’t know how to contact them anyway. Does anybody have their contact information? Here is the title, credits, and warning:

Daniel F. McCaffrey;         Tim R. Sass;                         J. R. Lockwood
The RAND Corporation; Florida State University; The RAND Corporation
Original Version: April 9, 2008
This Version: June 27, 2008

*This paper has not been formally reviewed and should not be cited, quoted, reproduced, or retransmitted without the authors’ permission. This material is based on work supported by a supplemental grant to the National Center for Performance Initiatives funded by the United States Department of Education, Institute of Education Sciences. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of these organizations.

Published in: on March 30, 2012 at 3:58 pm  Comments (7)  
Tags: , ,

DCPS consultant writes that there is little correlation between principal evaluation scores and VAM (or IVA) scores

I quote from an official DCPS report written by a consultant named Rachel Curtis in the employ of the Aspen Institute:

“DCPS analyzed the relationship between TLF rubric scores and individual teacher value-added scores based on the DC-CAS.

“At this early stage in the use of value-added analysis nationally, the hope is that there is a strong correlation between a teachers’ score on an instructional rubric and his or her value-added score. This would validate the instructional rubric by showing that doing well in instruction produces better student outcomes. DCPS analysis at the end of the first year
of IMPACT suggests that there is a modest correlation between the two ratings (0.34). 

DCPS’s correlations are similar to those of other districts that are using both an instructional
rubric and value-added data. A moderate correlation suggests that while there is a correlation between the assessment of instruction and student learning as measured by standardized tests (for the most part), it is not strong. At this early stage of using value-added data this is an issue that needs to be further analyzed.”

Ya know, if if the educational Deformers running the schools today were honest, they would admit that they’re still working the bugs and kinks out of this weird evaluation system. They would run a few pilot studies here or there, no stakes on anyone, so nobody cheats, and see how it goes. Then either revise it or get rid of it entirely.

Instead, starting in Washington, DC just a few years ago, with Michelle Rhee and Adrian Fenty leading the way locally and obscenely rich financiers funding the entire campaign, they rushed through an elaborate system of secret formulas and rigid rubrics, known as IMPACT. It appears that their goal of demoralizing teachers and convincing the public that public schools need to be closed and be turned over to the same hedge fund managers that brought us the current Great Depression, high unemployment rates, foreclosures. While the gap between the very wealthiest and the rest of the population, especially the bottom 50%, has become truly phenomenal.

Here’s a little table from the report, same page:

(Just so you know, I’ve been giving r^2 in my previous columns, not r. I believe they are using r; to compare that to my previous analyses, if you take 0.34 and square it, you get about 0.1156. That means that the IVA “explains” about 12% of the TLF, and vice versa. Pretty weak stuff.

Would I be alone in suggesting that the “hope” of a strong correlation has not been fulfilled? In fact, I think that’s a pretty measley correlation, and it suggests to me the possibility that neither the formal TLF evaluation rubrics done by administrators, nor the Individual Value-Added magic secret formulas, do an adequate or even competent job of measuring the output of teachers.

Excellent DCPS Teacher Fired For Low Value-Added Scores — Because of Cheating?

You really need to read this article in today’s Washington Post, by Bill Turque. It describes the situation of Sarah Wysocki, a teacher at MacFarland, who was given excellent evaluations by her administrators during her second year; but since her “Value-Added” scores were low for the second year in a row, she was fired.



Ms. Wysocki raises the possibility that someone cheated at Barnard, the school where a lot of her students had attended the previous year; she said that there were students who scored “advanced” in reading who could, in fact, not read at all.

Curious, I looked at the OSSE data for Barnard and found that the percentages of “advanced” students in grades 3 and 4 had what looks to me to be some rather suspicious drops from SY2009-10 to SY 2010-2011, at a school that apparently has a 70% to 80% free-or-reduced-price lunch population:

Grade 3, reading, 2010: 11% “advanced” but only 3% the next year;

Grade 4, reading, 2010: 29% “advanced”, but only 7% the next year.

Ms. Wysocki raised the accusation of cheating, but, as usual, DCPS administration put a bunch of roadblocks in the way and deliberately failed to investigate.

And naturally, Jason Kamras thinks he’s doing a peachy job and that there is nothing wrong with IMPACT or DC’s method of doing value-added computations.

Published in: on March 7, 2012 at 10:38 am  Comments (5)  
Tags: , , , , ,

Gary Rubenstein Demonstrates That the NYC ‘Value Added’ Measurements are Insane

Gary Rubenstein has two excellent posts where he analyzes what happened with the New York Public School System’s value-added measurements for teachers, which were just released.

He discovered several very important things:

(1) There is almost no correlation between a teacher’s score in 2009 to that for the following year.

(2) There is almost no correlation between a teacher’s score when teaching math and when teaching reading – to the same kids, the same year, and in the same elementary class.

(3) There is almost no correlation between a teacher’s score when teaching different grade levels of the same subject (i.e., Math 6 versus Math 7, and so on).

In other words, the Value Added Methodology is very close to being a true random number generator — which would be great if we were playing some sort of fantasy role-playing game or a board game like Monopoly or Yahtzee. But it’s an utterly ridiculous way to run a school system and to evaluate teachers.

I highly recommend reading his two blogs on this topic, which are here (for the first part) and here (for the second part).

After you read them, you need to pass the word (email, word of mouth, twitter, Like, facebook, whatever).

We need to kill this value-added mysticism and drive a special wooden stake through its evil, twisted heart.


Published in: on February 28, 2012 at 11:49 pm  Comments (4)  
Tags: , ,

More on VAM (Valueless Abracadabra Mongering)

It’s worth your while to read this article in a paper from New York, concerning a lot of the problems attendent and inherent in the so-called Value Added Measurements.


A few quotes from the article:

“[…] 31 percent of English teachers who ranked in the bottom quintile of teachers in 2007 had jumped to one of the top two quintile by 2008. About 23 percent of math teachers made the same jump.

“There was an overall correlation between how a teacher scored from one year to the next, and for some teachers, the measurement was more stable. Of the math teachers who ranked in the top quintile in 2007, 40 percent retained that crown in 2008.

“The weaknesses of value-added detailed in the report include:

  • “the fact that value-added scores are inherently relative, grading teachers on a curve — and thereby rendering the goal of having only high value-added teachers ‘a technical impossibility,’ as Corcoran writes

  • “the interference of imperfect state tests, which, when swapped with other assessments, can make a teacher who had looked stellar suddenly look subpar

  • “and the challenge of truly eliminating the influence of everything else that happens in a school and a classroom from that ‘unique contribution’ by the teacher

Published in: on September 19, 2010 at 4:00 pm  Comments (4)  
Tags: , , , , ,
%d bloggers like this: