Texas Decision Slams Value Added Measurements

And it does so for many of the reasons that I have been advocating. I am going to quote the entirety of Diane Ravitch’s column on this:


Audrey Amrein-Beardsley of Arizona State University is one of the nation’s most prominent scholars of teacher evaluation. She is especially critical of VAM (value-added measurement); she has studied TVAAS, EVAAS, and other similar metrics and found them deeply flawed. She has testified frequently in court cases as an expert witness.

In this post, she analyzes the court decision that blocks the use of VAM to evaluate teachers in Houston. The misuse of VAM was especially egregious in Houston, which terminated 221 teachers in one year, based on their VAM scores.

This is a very important article. Amrein-Beardsley and Jesse Rothstein of the University of California testified on behalf of the teachers; Tom Kane (who led the Gates’ Measures of Effective Teaching (MET) Study) and John Friedman (of the notorious Chetty-Friedman-Rockoff study) testified on behalf of the district.

Amrein-Beardsley writes:

Of primary issue will be the following (as taken from Judge Smith’s Summary Judgment released yesterday): “Plaintiffs [will continue to] challenge the use of EVAAS under various aspects of the Fourteenth Amendment, including: (1) procedural due process, due to lack of sufficient information to meaningfully challenge terminations based on low EVAAS scores,” and given “due process is designed to foster government decision-making that is both fair and accurate.”

Related, and of most importance, as also taken directly from Judge Smith’s Summary, he wrote:

HISD’s value-added appraisal system poses a realistic threat to deprive plaintiffs of constitutionally protected property interests in employment.

HISD does not itself calculate the EVAAS score for any of its teachers. Instead, that task is delegated to its third party vendor, SAS. The scores are generated by complex algorithms, employing “sophisticated software and many layers of calculations.” SAS treats these algorithms and software as trade secrets, refusing to divulge them to either HISD or the teachers themselves. HISD has admitted that it does not itself verify or audit the EVAAS scores received from SAS, nor does it engage any contractor to do so. HISD further concedes that any effort by teachers to replicate their own scores, with the limited information available to them, will necessarily fail. This has been confirmed by plaintiffs’ expert, who was unable to replicate the scores despite being given far greater access to the underlying computer codes than is available to an individual teacher [emphasis added, as also related to a prior post about how SAS claimed that plaintiffs violated SAS’s protective order (protecting its trade secrets), that the court overruled, see here].

The EVAAS score might be erroneously calculated for any number of reasons, ranging from data-entry mistakes to glitches in the computer code itself. Algorithms are human creations, and subject to error like any other human endeavor. HISD has acknowledged that mistakes can occur in calculating a teacher’s EVAAS score; moreover, even when a mistake is found in a particular teacher’s score, it will not be promptly corrected. As HISD candidly explained in response to a frequently asked question, “Why can’t my value-added analysis be recalculated?”:

Once completed, any re-analysis can only occur at the system level. What this means is that if we change information for one teacher, we would have to re- run the analysis for the entire district, which has two effects: one, this would be very costly for the district, as the analysis itself would have to be paid for again; and two, this re-analysis has the potential to change all other teachers’ reports.

The remarkable thing about this passage is not simply that cost considerations trump accuracy in teacher evaluations, troubling as that might be. Of greater concern is the house-of-cards fragility of the EVAAS system, where the wrong score of a single teacher could alter the scores of every other teacher in the district. This interconnectivity means that the accuracy of one score hinges upon the accuracy of all. Thus, without access to data supporting all teacher scores, any teacher facing discharge for a low value-added score will necessarily be unable to verify that her own score is error-free.

HISD’s own discovery responses and witnesses concede that an HISD teacher is unable to verify or replicate his EVAAS score based on the limited information provided by HISD.

According to the unrebutted testimony of plaintiffs’ expert, without access to SAS’s proprietary information – the value-added equations, computer source codes, decision rules, and assumptions – EVAAS scores will remain a mysterious “black box,” impervious to challenge.

While conceding that a teacher’s EVAAS score cannot be independently verified, HISD argues that the Constitution does not require the ability to replicate EVAAS scores “down to the last decimal point.” But EVAAS scores are calculated to the second decimal place, so an error as small as one hundredth of a point could spell the difference between a positive or negative EVAAS effectiveness rating, with serious consequences for the affected teacher.

Hence, “When a public agency adopts a policy of making high stakes employment decisions based on secret algorithms incompatible with minimum due process, the proper remedy is to overturn the policy.”

What I actually had time to say …

Since I had to abbreviate my remarks, here is what I actually said:

I am Guy Brandenburg, retired DCPS mathematics teacher.

To depart from my text, I want to start by proposing a solution: look hard at the collaborative assessment model being used a few miles away in Montgomery County [MD] and follow the advice of Edwards Deming.

Even though I personally retired before [the establishment of the] IMPACT [teacher evaluation system], I want to use statistics and graphs to show that the Value-Added measurements that are used to evaluate teachers are unreliable, invalid, and do not help teachers improve instruction. To the contrary: IVA measurements are driving a number of excellent, veteran teachers to resign or be fired from DCPS to go elsewhere.

Celebrated mathematician John Ewing says that VAM is “mathematical intimidation” and a “modern, mathematical version of the Emperor’s New Clothes.”

I agree.

One of my colleagues was able to pry the value-added formula [used in DC] from [DC data honcho] Jason Kamras after SIX MONTHS of back-and-forth emails. [Here it is:]

value added formula for dcps - in mathtype format

One problem with that formula is that nobody outside a small group of highly-paid consultants has any idea what are the values of any of those variables.

In not a single case has the [DCPS] Office of Data and Accountability sat down with a teacher and explained, in detail, exactly how a teacher’s score is calculated, student by student and class by class.

Nor has that office shared that data with the Washington Teachers’ Union.

I would ask you, Mr. Catania, to ask the Office of Data and Accountability to share with the WTU all IMPACT scores for every single teacher, including all the sub-scores, for every single class a teacher has.

Now let’s look at some statistics.

My first graph is completely random data points that I had Excel make up for me [and plot as x-y pairs].

pic 3 - completely random points

Notice that even though these are completely random, Excel still found a small correlation: r-squared was about 0.08 and r was about 29%.

Now let’s look at a very strong case of negative correlation in the real world: poverty rates and student achievement in Nebraska:

pic  4 - nebraska poverty vs achievement

The next graph is for the same sort of thing in Wisconsin:

pic 5 - wisconsin poverty vs achievement

Again, quite a strong correlation, just as we see here in Washington, DC:

pic 6 - poverty vs proficiency in DC

Now, how about those Value-Added scores? Do they correlate with classroom observations?

Mostly, we don’t know, because the data is kept secret. However, someone leaked to me the IVA and classroom observation scores for [DCPS in] SY 2009-10, and I plotted them [as you can see below].

pic 7 - VAM versus TLF in DC IMPACT 2009-10

I would say this looks pretty much no correlation at all. It certainly gives teachers no assistance on what to improve in order to help their students learn better.

And how stable are Value-Added measurements [in DCPS] over time? Unfortunately, since DCPS keeps all the data hidden, we don’t know how stable these scores are here. However, the New York Times leaked the value-added data for NYC teachers for several years, and we can look at those scores to [find out]. Here is one such graph [showing how the same teachers, in the same schools, scored in 2008-9 versus 2009-10]:

pic 8 - value added for 2 successive years Rubenstein NYC

That is very close to random.

How about teachers who teach the same subject to two different grade levels, say, fourth-grade math and fifth-grade math? Again, random points:

pic 9 - VAM for same subject different grades NYC rubenstein

One last point:

Mayor Gray and chancellors Henderson and Rhee all claim that education in DC only started improving after mayoral control of the schools, starting in 2007. Look for yourself [in the next two graphs].

pic 11 - naep 8th grade math avge scale scores since 1990 many states incl dc

 

pic 12 naep 4th grade reading scale scores since 1993 many states incl dc

Notice that gains began almost 20 years ago, long before mayoral control or chancellors Rhee and Henderson, long before IMPACT.

To repeat, I suggest that we throw out IMPACT and look hard at the ideas of Edwards Deming and the assessment models used in Montgomery County.

A Modest Proposal: NRFEL

I have a modest proposal.


The lower a student performs on the various tests, obviously the more resources it takes to get that student up to par (however you define “par”).

Obviously, right now, regular public schools, especially those in low-income areas, have disproportionately large percentages of those such low-performing, high-needs students.

The current, popular accusation is that the school teachers in those inner-city schools are deliberately sabotaging the learning of those students (and causing that low performance), under union protection.

It is charged that if schools were privatized in general, and/or if teacher union organizations were smashed, then freed-up non-union public schools, and also charter and private and parochial schools, would do a better job.

But today, let’s be honest. All of those high-performing schools are selective. And/or, they put out the low-performers, and the ones they consider ‘rotten apples.’

There has to be some place for housing the kids who are put out of, or simply not allowed in to, more-exclusive schools (be they charter, boarding, magnet, ritzy private, ritzy public, etc., etc…). And guess where that is?

Right. The regular, comprehensive public schools. Especially in poor rural areas and the inner city, there are lots of kids with lots of serious deficiencies, which take a LOT of work to overcome. But many of these schools are totally overwhelmed — I’ve seen it. I’ve seen schools in total chaos, where much of the time, nearly no teaching and learning can possibly take place. Or else it takes an absolute Superman or Wonder Woman to accomplish some teaching in one corner of the school, and only with lots of administrative support, which is denied to the rest of the school…  I’ve seen that, too.

OK,  If those other schools do so much better, let’s try a truly randomized experiment to see if that’s really true. Or else let’s give all of our kids the opportunity to go there.


But what if we turn that on its head? And actually use the ONE positive proposal that Michele Rhee, ever came up with?

Here it is: In four words, it’s this:

USE A REAL LOTTERY.

Use a real lottery for all.

I will call my proposal the Non-Revokable Full-year-long Exchange Lottery (NRFEL for short).

Under NRFEL, in every officially designated ‘failing’ public school, all of the low-performing students would be placed in a lottery. Based on the outcome of the lottery, those students would be selected either to :


(1) stay at their regular school, or

(2) to attend a randomly-selected high-performing school; said school would be either…

(a) located within a two-hour bus ride of the home of the student, or

(b) be a boarding school located anywhere in the USA.

Important terms:

(3) All this would be for no extra taxpayer dollars. Yup.

(4) None of these exchange students could be denied entry, for any reason.

(5) None of these exchange students could be subsequently be put out by the receiving school, FOR ANY REASON until the end of the school year, and the students and their parents would know that. (6) Re-assessments would take place exactly once a year, during the summer break, to discern whether the exchange should continue. If the student then is performing adequately, he or she would return to his or her original school. Let me repeat: those high-performing schools would include ALL high-performing schools within a 2-hour bus ride. Oh, and they also include ALL boarding schools in the nation. For no additional money.

Don’t worry about overcrowding the receiving schools. NRFEL takes care of that. as follows.

(7) Each student in each high-achieving school is also placed in a lottery.

(8) Every school that receives one of the low-achieving or handicapped students from a ‘failing’ would simply send back one adequately -performing student, chosen at random in this second lottery. It could be worked out later whether there would be exact, one-for-one exchanges, or whether all students being moved would be put into a general “pool”.  This is a 1:1 exchange ratio: one kid in, one kid out. So class sizes, overall, wouldn’t rise. But there might be need for physical therapists, mental health and social service professionals, reading and math specialists, as well as security guards in some cases. None of which the school district shall be liable for funding.

A very good question arises: what if the receiving school receives so many low-achieving students that it is overwhelmed and enters the category of “failing school” because they are unable to work enough of a miracle in one year? Well, then they can enter the lottery the next year on the other side of the tracks (so to speak).


One aspect I haven’t decided on yet for NRFEL is whether there should also be a similar randomized exchanges of teachers and staff and administration between high-achieveing and low-achieving schools. So I will put this is up for debate. Perhaps this feature could be a separate experiment in geographical region. (Imagine teachers and staff at Sidwell, Holton-Arms, and BCC randomly exchanging places with teachers at schools in deepest Anacostia or inside the near-DC PG County Beltway area.)


I know what you are thinking: NCLB has something like this, but often there is no room in the ‘receiving schools’. In fact, this has happened a lot in DCPS already. NRFEL takes care of this. First, it’s random, so it’s not merely selecting the kids with the most-motivated parents. Second, it’s ALL schools, no matter what denomination, ownership status, or jurisdiction. The exact numbers of exchange students and their distribution could be debated in committee hearings. I propose that each geographical region (think, Washington Metro Area, or Greater Washington, or Delmarva Peninsula, or Greater New York) would take a census of all youth, and their academic levels, to decide how to allot those students among high-and-low-achieving schools. After all, just about all of our public school students have to take lots of standardized tests. What better possible use could we make of this data? NRFEL’s goal is equalizing educational opportunity for all youth, and isn’t that supposedly what America is based on?
Let me emphasize one thing. None of these receiving schools would have the right or capacity to send any of these students back, nor to expel them. They would have to keep them and deal with them for a full school year, whether they are sick, incarcerated or hospitalized, or truant, or  whether they come to school each and every single day and join the rugby or football or hockey or computer-tech club at their new school. For no extra expense, remember.


What ever could we use to ‘persuade’ parochial and private schools to go along? Public charter schools and magnet schools are funded by public money anyway, so they would have to comply. But think of this: private and religious schools get substantial benefits and subsidies from society and government. I will just mention one public subsidy for these schools: tax exemption!


(BTW: have you recently noticed the bill for tuition at the high-flying local private schools?)


Oh, and the low-performing schools can’t put their high-performing return-exchange students out, either. Though those schools might just find that those students will hold their own pretty well, forming substantial fractions of the school’s student government, athletic teams, and other clubs, not to mention their honor roll…


Waddaya say?

Whether DC-CAS scores go up or down at any school seems mostly to be random!

After reviewing the changes in math and reading scores at all DC public schools for 2006 through 2009, I have come to the conclusion that the year-to-year school-wide changes in those scores are essentially random. That is to say, any growth (or slippage) from one year to the next is not very likely to be repeated the next year.

Actually, it’s even worse than that.The record shows that any change from year 1 to year 2 is somewhat NEGATIVELY correlated to the changes between year 2 and year 3. That is, if there is growth from year 1 to year 2, then, it is a bit more likely than not that there will be a shrinkage between year 2 and year 3.  Or, if the scores got worse from year 1 to year 2, they there is a slightly better-than-even chance that the scores will improve the following year.

And it doesn’t seem to matter whether the same principal is kept during all three years, or whether the principals are replaced one or more times over the three-year period.

In other words, all this shuffling of principals (and teachers) and turning the entire school year into preparation for the DC-CAS seems to be futile. EVEN IF YOU BELIEVE THAT THE SOLE PURPOSE OF EDUCATION IS TO PRODUCE HIGH STANDARDIZED TEST SCORES. (Which I don’t.)

Don’t believe me? I have prepared some scatterplots, below, and you can see the raw data here as a Google Doc.

My first graph is a scatterplot relating the changes in percentages of students scoring ‘proficient’ or better on the reading tests from Spring 2006 to Spring 2007 on the x-axis, with changes in percentages of students scoring ‘proficient’ or better in reading from ’07 to ’08 on the y-axis, at DC Public Schools that kept the same principals for 2005 through 2008.

If there were a positive correlation between the two time intervals in question, then the scores would cluster mostly in the first and third quadrants. And that would mean that if scores grew from ’06 to ’07 then they also grew from ’07 to ’08; or if they went down from ’06 to ’07, then they also declined from ’07 to ’08.

But that’s not what happened. In fact, in the 3rd quadrant, I only see one school – apparently  M.C.Terrell – where the scores went down during both intervals. However, there are about as many schools in the second quadrant as in the first quadrant. Being in the second quadrant means that the scores declined from ’06 to ’07 but then rose from ’07 to ’08. And there appear to be about 7 schools in the fourth quadrant. Those are schools where the scores rose from ’06 to ’07 but then declined from ’07 to ’08.

I asked Excel to calculate a regression line of best fit between the two sets of data, and it produced the line that you see, slanted downwards to the right. Notice that R-squared is 0.1998, which is rather weak. If we look at R, the square root of R-squared, that’s the regression constant, my calculator gives me -0.447, which means again that the correlation between the growth (or decline) from ’06 to ’07 is negatively correlated to the growth (or decline) from ’07 to ’08 – but not in a strong manner.

OK. Well, how about during years ’07-’08-’09? Maybe Michelle Rhee was better at picking winners and losers than former Superintendent Janey? Let’s take a look at schools where she allowed the same principal to stay in place for ’07, ’08, and ’09:

Actually, this graph looks worse! There are nearly twice as many schools in quadrant four as in quadrant one! That means that there are lots of schools where reading scores went up between ’07 and ’08, but DECLINED from ’08 to ’09; but many fewer schools where the scores went up both years. In the second quadrant, I  see about four schools where the scores declined from ’07 to ’08 but then went up between ’08 and ’09. Excel again provided a linear regression line of best fit, and again, the line slants down and to the right. R-squared is 0.1575, which is low. R itself is about -0.397, which is, again, rather low.

OK, what about schools where a principal got replaced? If you believe that all veteran administrators are bad and need to be replaced with new ones with limited or no experience, you might expect to see negative correlations, but with positive overall outcomes; in other words, the scores should cluster in the second quadrant. Let’s see if that’s true. First, reading changes over the period 2006-’07-’08:

Although there are schools in the second quadrant, there are also a lot in the first quadrant, and I also see more schools in quadrants 3 and 4 than we’ve seen in the first two graphs. According to Excel, R-squared is extremely low: 0.0504, which means that R is about -0.224, which means, essentially, that it is almost impossible to predict what the changes would be from one year to the next.

Well, how about the period ’07-’08-’09? Maybe Rhee did a better job of changing principals then? Let’s see:

Nope. Once again, it looks like there are as many schools in quadrant 4 as in quadrant 1, and considerably fewer in quadrant 2. (To refresh your memory: if a school is in quadrant 2, then the scores went down from ’07 to ’08, but increased from ’08 to ’09. That would represent a successful ‘bet’ by the superintendent or chancellor. However, if a school is in quadrant 4, that means that reading scores went up from ’07 to ’08, but went DOWN from ’08 to ’09; that would represent a losing ‘bet’ by the person in charge.) Once again, the line of regression slants down and to the right.  The value of R-squared, 0.3115, is higher than in any previous scatterplot (I get R = -0.558) which is not a good sign if you believe that superintendents and chancellors can read the future.

Perhaps things are more predictable with mathematics scores? Let’s take a look. First, changes in math scores during ’06-’07-’08 at schools that kept the same principal all 3 years:

Doesn’t look all that different from our first Reading graph, does it? Now, math score changes during ’07-’08-’09, schools with the same principal all 3 years:

Again, a weak negative correlation. OK, what about schools where the principals changed at least once? First look at ’06-’07-‘-8:

And how about ’07-’08-’09 for schools with at least one principal change?

Again, a very weak negative correlation, with plenty of ‘losing bets’.

Notice that every single one of these graphs presented a weak negative correlation, with plenty of what I am calling “losing bets” – by which I mean cases where the scores went up from the first year to the second, but then went down from the second year to the third.

OK. Perhaps it’s not enough to change principals once every 3 or 4 years. Perhaps it’s best to do it every year or two? (Anybody who has actually been in a school knows that when the principal gets replaced frequently, then it’s generally a very bad sign. But let’s leave common sense aside for a moment.) Here we have scatterplots showing what the situation was, in reading and math, from ’07 through ’09, at schools that had 2 or more principal changes from ’06 to ’09:

and

This conclusion is not going to win me lots of friends among those who want to use “data-based” methods of deciding whether teachers or administrators keep their jobs, or how much they get paid. But facts are facts.

==============================================================================

A little bit of mathematical background on statistics:

Statisticians say that two quantities (let’s call them A and B) are positively correlated when an increase in one quantity (A)  is linked to an increase in the other quantity (B). An example might be a person’s height(for quantity A) and length of a person’s foot (for quantity B). Generally, the taller you are, the longer your feet are. Yes, there are exceptions, so these two things don’t have a perfect correlation, but the connection is pretty strong.

If two things are negatively correlated, that means that when one quantity (A) increases, then the other quantity (B) decreases. An example would be the speed of a runner versus the time it takes to run a given distance.  The higher the speed at which the athlete runs, the less time it takes to finish the race. And if you run at a lower speed, then it takes you more time to finish.

And, of course, there are things that have no correlation to speak of.

Published in: on March 13, 2010 at 3:37 pm  Comments (2)  
Tags: , , , , ,
%d bloggers like this: