Why VAM doesn’t work

Peter Greene does a wonderful job in explaining why ‘value-added measurements’ for teachers are complete nonsense. I am reprinting the whole thing, because, as usual, he has hit the nail right on the head, and done it in a more complete and thorough manner than anybody else, most definitely including me.

VAM: Why Is This Zombie Policy Still Around?

by PETER GREENE

NOV 20

It was a bit of a shock. I picked up my morning paper, and there was an article on the front page touting our school district’s PVAAS scores, the commonwealth of Pennsylvania’s version of VAM scores, and I was uncomfortably reminded that value-added measures are still a thing.

Value Added Measures are bunk.

We used to talk about this a lot. A. Lot. But VAM (also known as Something-VAAS in some states) has departed the general education discussion even though it has not departed the actual world of education. Administrators still brag about, or bemoan, their VAM scores. VAM scores still affect teacher evaluation. And VAM scores are still bunk.

So let’s review. Or if you’re new-ish to the ed biz, let me introduce you to what lies behind the VAM curtain.

The Basic Idea

Value Added is a concept from the manufacturing and business world. If I take a dollar’s worth of sheet metal and turn it into a forty dollar toaster, then I have added thirty-nine dollars of value to the sheet metal. It’s an idea that helps businesses figure out if they’re really making money on something, or if adding some feature to a product or process is worth the bother.

Like when you didn’t fix the kitchen door before you tried to sell your house because fixing the door would have cost a grand but would allowed you to raise the price of the house a buck and a half. Or how a farmer might decide that putting a little more meat on bovine bones would cost more than you’d make back from selling the slightly fatter cow.

So the whole idea here is that schools are supposed to add value to students, as if students are unmade toasters or unfatted calves, and the school’s job is to make them worth more money.

Yikes! Who decided this would be a good thing to do with education?

The Granddaddy of VAAS was William Sanders. Sanders grew up on a dairy farm and went on to earn a PhD in biostatistics and quantitative genetics. He was mostly interested in questions like “If you have a choice of buying Bull A, compared to Bull B, which one is more likely to produce daughters that will give more milk than the other one.” Along with some teaching, Sanders was a longtime statistical consultant for the Institute of Agricultural Research.

He said that in 1982, while an adjunct professor at a satellite campus of the University of Tennessee, he read an article (written by then-Governor Lamar Alexander) saying that there’s no proper way to hold teachers accountable for test scores.

Sure there is, he thought. He was certain he could just tweak the models he used for crunching agricultural statistics and it would work great. He sent the model off to Alexander, but it languished unused until the early 90s, when the next governor pulled it out and called Sanders in, and Educational Value-Added Assessment System (EVAAS) was on its way.

The other Granddaddy of VAAS is SAS, an analytics company founded in 1976.

Founder James H. Goodnight was born in 1943 in North Carolina. He earned a Masters in statistics; that combined with some programming background landed him a job with a company that built communication stations for the Apollo program.

He next went to work as a professor at North Carolina State University, where he and some other faculty created Statistical Analysis System for analyzing agricultural data, a project funded mainly by the USDA. Once the first SAS was done and had acquired 100 customers, Goodnight et al left academia and started the company.

William Sanders also worked a North Carolina University researcher, and it’s not clear when, exactly, he teamed up with SAS; his EVAAS system was proprietary, and as the 90s unfolded, that made him a valuable man to go into business with. The VAAS system, rebranded for each state that signed on, became a big deal for SAS, who launched their Education Technologies Division in 1997.

Sanders passed away in 2017. Goodnight has done okay. The man owns two thirds of the company, which is still in the VAAS biz, and he’s now worth $7.4 billion-with-a-B. But give him credit, apparently remembering his first crappy job, Goodnight has made SAS one of the world’s best places to work– in fact, it is SAS that influenced the more famously fun-to-work culture of Google. It’s a deep slice of irony–he has sustained a corporate culture that emphasizes valuing people as live human beings, not as a bunch of statistics.

Somehow Goodnight has built a little world where people live and work among dancing rainbows and fluffy fairy dust clouds, and they spend their days manufacturing big black rainclouds to send out into the rest of the world.

How does it work?

Explanations are layered in statistics jargon:

Using mixed model equations, TVAAS uses the covariance matrix from this multivariate, longitudinal data set to evaluate the impact of the educational system on student progress in comparison to national norms, with data reports at the district, school, and teacher levels.

Sanders’ explanations weren’t any better. In 2009, several of us were sent off to get training in how to use PA’s version (PVAAS) and among other things, I wrote this:

This is a highly complex model that three well-paid consultants could not clearly explain to seven college-educated adults, but there were lots of bars and graphs, so you know it’s really good. I searched for a comparison and first tried “sophisticated guess;” the consultant quickly corrected me—“sophisticated prediction.”

I tried again—was it like a weather report, developed by comparing thousands of instances of similar conditions to predict the probability of what will happen next? Yes, I was told. That was exactly right. This makes me feel much better about PVAAS, because weather reports are the height of perfect prediction.

The basic mathless idea is this. Using sophisticated equations, the computer predicts what Student A would likely score on this year’s test in some alternate universe where no school-related factors affected the student’s score. Then the computer looks at the score that Actual Student A achieved. If Actual Student and Alternative Universe Student have different scores, the difference, positive or negative, is attributed to the teacher.

Let me say that again. The computer predicts a student score. If the actual student gets a different score, that is not attributed to, say, a failure on the part of the predictive software. All the blame and/or glory belong to the teacher.

VAAS fans insist that the model mathematically accounts for factors like socio-economic background and school and other stuff. Here’s the explanatory illustration:

Here’s a clarification of that illustration:

“This is stuff we made up to pretend we can predict one kid’s test scores”

So how well does it actually work?

Audrey Amrein-Beardsley, a leading researcher and scholar in this field, ran a whole blog for years (VAMboozled) that did nothing but bring to light the many ways in which VAM systems were failing, so I’m going to be (sort of) brief here and stick to a handful of illustrations.

Let’s ask the teachers.

Clarin Collins, a researcher, college professor and, as of this year, a high school English teacher, had a crazy idea back in 2014–why not ask teachers if they were getting anything of value out of the VAAS?

Short answer: no.

Long answer. Collins made a list of the various marketing promises made by SAS about VAAS and asked teachers if they agreed or disagreed (they could do so strongly, too). Here’s the list:

EVAAS helps create professional goals

EVAAS helps improve instruction

EVAAS will provide incentives for good practices

EVAAS ensures growth opportunities for very low achieving students

EVAAS ensures growth opportunities for students

EVAAS helps increase student learning

EVAAS helps you become a more effective teacher

Overall, the EVAAS is beneficial to my school

EVAAS reports are simple to use

Overall, the EVAAS is beneficial to me as a teacher

Overall, the EVAAS is beneficial to the district

EVAAS ensures growth opportunities for very high achieving students

EVAAS will identify excellence in teaching or leadership

EVAAS will validly identify and help to remove ineffective teachers

EVAAS will enhance the school environment

EVAAS will enhance working conditions

That’s arranged in descending order, starting from the top item, with which over 50% of teachers disagreed. By the time we get to the bottom of the list, the rate of disagreement is almost 80%. At the top of the list, fewer than 20% of teachers agreed or strongly agreed, and it just went downhill from there.

Teachers reported that the data reported was “vague” and “unusable.” They complained that their VAAS rating scores whipped up and down from year to year with no rhyme nor reason, with over half finding their VAAS number way different from their principal evaluation. Gifted teachers, because they had the students who had already hit their ceiling, reported low VAAS scores. And while the VAAS magic math is supposed to blunt the impact of having low-ability students in your classroom, it turns out it doesn’t really do that. And this:

Numerous teachers reflected on their own questionable practices. As one English teacher

said, “When I figured out how to teach to the test, the scores went up.” A fifth grade teacher added,

“Anything based on a test can be ‘tricked.’ EVAAS leaves room for me to teach to the test and

appear successful.”

EVAAS also assumes that the test data fed into the system is a valid measure of what it says it measures. That’s a generous view of tests like Pennsylvania’s Keystone Exam. Massaging bad data with some kind of sophisticated mathiness still just gets you bad data.

But hey–that’s just teachers and maybe they’re upset about being evaluated with rigor. What do other authorities have to say?

The Houston Court Case

The Houston school district used EVAAS to not only evaluate teachers, but factor in pay systems as well. So the AFT took them to court. A whole lot of experts in education and evaluation and assessment came to testify, and when all was said and done, here are twelve big things that the assembled experts had to say about EVAAS:

1) Large-scale standardized tests have never been validated for this use. A test is only useful for the purpose for which it is designed. Nobody has designed a test for VAM purposes.

2) When tested against another VAM system, EVAAS produced wildly different results.

3) EVAAS scores are highly volatile from one year to the next.

4) EVAAS overstates the precision of teachers’ estimated impacts on growth. The system pretends to know things it doesn’t really know.

5) Teachers of English Language Learners (ELLs) and “highly mobile” students are substantially less likely to demonstrate added value. Again, the students you teach have a big effect on the results that you get.

6) The number of students each teacher teaches (i.e., class size) also biases teachers’ value-added scores.

7) Ceiling effects are certainly an issue. If your students topped out on the last round of tests, you won’t be able to get them to grow enough this year.

8) There are major validity issues with “artificial conflation.” (This is the phenomenon in which administrators feel forced to make their observation scores “align” with VAAS scores.) Administrators in Houston were pressured to make sure that their own teacher evaluations confirmed rather than contradicted the magic math.

9) Teaching-to-the-test is of perpetual concern. Because it’s a thing that can raise your score, and it’s not much like actual teaching.

10) HISD is not adequately monitoring the EVAAS system. HISD was not even allowed to see or test the secret VAM sauce. Nobody is allowed to know how the magic maths work. Hell, in Pennsylvania, teachers are not even allowed to see the test that their students took. You have to sign a pledge not to peek. So from start to finish, you have no knowledge of where the score came from.

11) EVAAS lacks transparency. See above.

12) Related, teachers lack opportunities to verify their own scores. Think your score is wrong? Tough.

The experts said that EVAAS was bunk. US Magistrate Judge Stephen Smith agreed, saying that “high stakes employment decisions based on secret algorithms (are)incompatible with… due process” and the proper remedy was to overturn the policy. Houston had to kiss VAAS goodbye.

Anyone else have thoughts?

The National Association of Secondary School Principals issued a statement in 2015 and revisited it in 2019:

At first glance, it would appear reasonable to use VAMs to gauge teacher effectiveness. Unfortunately, policymakers have acted on that impression over the consistent objections of researchers who have cautioned against this inappropriate use of VAMs.

The American Education Research Association also cautioned in 2015 against the use of VAM scores for any sort of high stakes teacher evaluation, due to significant technical limitations. They’ve got a batch of other research links, too.

The American Statistical Association released a statement in 2014 warning districts away from using VAM to measure teacher effectiveness. VAMs, they say, do not directly measure potential teacher contributions toward other student outcomes. Also, VAMs typically measure correlation, not causation: Effects – positive or negative – attributed to a teacher may actually be caused by other factors that are not captured in the model.

Most VAM studies find that teachers account for about 1% to 14% of the variability in test scores, and that the majority of opportunities for quality improvement are found in the system-level conditions. Ranking teachers by their VAM scores can have unintended consequences that reduce quality. 

They cite the “peer-reviewed study” funded by Gates and published by AERA which stated emphatically that “Value-added performance measures do not reflect the content or quality of teachers’ instruction.” This study went on to note that VAM doesn’t seem to correspond to anything that anybody considers a feature of good teaching.

What if we don’t use the data soaked in VAM sauce to make Big Decisions? Can we use it just to make smaller ones? Research into decade-long experiment in using student test scores to “toughen” teacher evaluation and make everyone teach harder and better showed that the experiment was a failure.

Well, that was a decade or so ago. I bet they’ve done all sorts of things to VAM and VAAS to improve them.

You would lose that bet.

Well, at least they don’t use them to evaluate teachers any more, right?

Sorry.

There’s a lot less talk about tying VAM to raises or bonus/merit pay, but the primary innovation is to drape the rhetorical fig leaf of “students growth” over VAM scores. The other response has been to try to water VAAS/VAM measures down with other “multiple measures,” an option that was handed to states back in 2015 when ESSA replaced No Child Left Behind as the current version of federal education law.

Pennsylvania has slightly reduced the size of PVAAS influence on teacher and building evaluations in the latest version of evaluation, but it’s still in there, both as part of the building evaluation that affects all teacher evaluations and as part of the evaluation for teachers who teach the tested subjects. Pennsylvania also uses the technique of mushing together “three consecutive years of data,” a technique that hopes to compensate for the fact that VAAS scores hop around from year to year.

VAAS/VAM is still out there kicking, still being used as part of a way to evaluate teachers and buildings. And it’s still bunk.

But we have to do something to evaluate schools and teachers!

You are taken to the hospital with some sort of serious respiratory problem. One afternoon you wake up suddenly to find some janitors standing over you with a chain saw.

“What the hell!” You holler. “What are you getting ready to do??!!”

“We’re going to amputate your legs with a chain saw,” they reply.

“Good lord,” you holler, trying to be reasonable. “Is there any reason to think that would help with my breathing?”

“Not really,” they reply. “Actually, all the medical experts say it’s a terrible idea.”

“Well, then, don’t do it! It’s not going to help. It’s going to hurt, a lot.”

“Well, we’ve got to do something.”

“Not that!”

“Um, well. What if we just take your feet off? I mean, this is what we’ve come up with, and if you don’t have a better idea, then we’re just going to go ahead with our chain saw plan.”

VAM is a stark example of education inertia in action. Once we’re doing something, somehow the burden of proof is shifted, and nobody has to prove that there’s a good reason to do thing, and opponents must prove they have a better idea. Until they do, we just keep firing up the chain saw.

There are better ideas out there (check out the work of Jack Schneider at University of Massachusetts Amherst) but this post is long enough already and honestly, if you’re someone who thinks it’s so important to reduce teachers’ work to a single score, the burden is on you to prove that you’ve come up with something that is valid, reliable, and non-toxic. A system that depends on the Big Standardized Tests and a mysterious black [box] to show that somehow teachers have made students more valuable is none of those things.

VAM systems have had over a decade to prove its usefulness. They haven’t. It’s long past time to put them in the ground.

© 2023 Peter Greene

Venangoland, PA

Unsubscribe

•            

‘Beatings Must Continue Until Morale Improves’

The ‘Value-Added Measurement’ movement in American education, implemented in part by the now-disgraced Michelle Rhee here in DC, has been a complete and utter failure, even as measured by its own yardsticks, as you will see below

Yet, the same corporate ‘reformers’ who were its major cheerleaders do not conclude from this that the idea was a bad one. Instead, they claim that it wasn’t tried with enough rigor and fidelity.

From “Schools Matter“:

=======================

How to Learn Nothing from the Failure of VAM-Based Teacher Evaluation

The Annenberg Institute for School Reform is a most exclusive academic club lavishly funded and outfitted at Brown U. for the advancement of corporate education in America. 

The Institute is headed by Susanna Loeb, who has a whole slew of degrees from prestigious universities, none of which has anything to do with the science and art of schooling, teaching, or learning.  

Researchers at the Institute are circulating a working paper that, at first glance, would suggest that school reformers might have learned something about the failure of teacher evaluation based on value-added models applied to student test scores. The abstract:

Starting in 2009, the U.S. public education system undertook a massive effort to institute new high-stakes teacher evaluation systems. We examine the effects of these reforms on student achievement and attainment at a national scale by exploiting the staggered timing of implementation across states. We find precisely estimated null effects, on average, that rule out impacts as small as 1.5 percent of a standard deviation for achievement and 1 percentage point for high school graduation and college enrollment. We also find little evidence of heterogeneous effects across an index measuring system design rigor, specific design features, and district characteristics. [my emphasis – GFB]

So could this mean that the national failure of VAM applied to teacher evaluation might translate to decreasing the brutalization of teachers and the waste of student learning time that resulted from the implementation of VAM beginning in 2009?

No such luck.   

The conclusion of the paper, in fact, clearly shows that the Annenbergers have concluded that the failure to raise test scores by corporate accountability means (VAM) resulted from laggard states and districts that did not adhere strictly to the VAM’s mad methods.  In short, the corporate-led failure of VAM in education happened as a result of schools not being corporate enough:

Firms in the private sector often fail to implement best management practices and performance evaluation systems because of imperfectly competitive markets and the costs of implementing such policies and practices (Bloom and Van Reenen 2007). These same factors are likely to have influenced the design and implementation of teacher evaluation reforms. Unlike firms in a perfectly competitive market with incentives to implement management and evaluation systems that increase productivity, school districts and states face less competitive pressure to innovate. Similarly, adopting evaluation systems like the one implemented in Washington D.C. requires a significant investment of time, money, and political capital. Many states may have believed that the costs of these investments outweighed the benefits. Consequently, the evaluation systems adopted by many states were not meaningfully different from the status quo and subsequently failed to improve student outcomes.

So the Gates-Duncan RTTT corporate plan for teacher evaluation failed not because it was a corporate model but because it was not corporate enough!  In short, there were way too many small carrots and not enough big sticks.

Part Two: Cheating in DCPS

DC Education Reform Ten Years After, 

Part 2: Test Cheats

Richard P Phelps

Ten years ago, I worked as the Director of Assessments for the District of Columbia Public Schools (DCPS). For temporal context, I arrived after the first of the infamous test cheating scandals and left just before the incident that spawned a second. Indeed, I filled a new position created to both manage test security and design an expanded testing program. I departed shortly after Vincent Gray, who opposed an expanded testing program, defeated Adrian Fenty in the September 2010 DC mayoral primary. My tenure coincided with Michelle Rhee’s last nine months as Chancellor. 

The recurring test cheating scandals of the Rhee-Henderson years may seem extraordinary but, in fairness, DCPS was more likely than the average US school district to be caught because it received a much higher degree of scrutiny. Given how tests are typically administered in this country, the incidence of cheating is likely far greater than news accounts suggest, for several reasons: 

·      in most cases, those who administer tests—schoolteachers and administrators—have an interest in their results;

·      test security protocols are numerous and complicated yet, nonetheless, the responsibility of non-expert ordinary school personnel, guaranteeing their inconsistent application across schools and over time; 

·      after-the-fact statistical analyses are not legal proof—the odds of a certain amount of wrong-to-right erasures in a single classroom on a paper-and-pencil test being coincidental may be a thousand to one, but one-in-a-thousand is still legally plausible; and

·      after-the-fact investigations based on interviews are time-consuming, scattershot, and uneven. 

Still, there were measures that the Rhee-Henderson administrations could have adopted to substantially reduce the incidence of cheating, but they chose none that might have been effective. Rather, they dug in their heels, insisted that only a few schools had issues, which they thoroughly resolved, and repeatedly denied any systematic problem.  

Cheating scandals

From 2007 to 2009 rumors percolated of an extraordinary level of wrong-to-right erasures on the test answer sheets at many DCPS schools. “Erasure analysis” is one among several “red flag” indicators that testing contractors calculate to monitor cheating. The testing companies take no responsibility for investigating suspected test cheating, however; that is the customer’s, the local or state education agency. 

In her autobiographical account of her time as DCPS Chancellor, Michelle Johnson (nee Rhee), wrote (p. 197)

“For the first time in the history of DCPS, we brought in an outside expert to examine and audit our system. Caveon Test Security – the leading expert in the field at the time – assessed our tests, results, and security measures. Their investigators interviewed teachers, principals, and administrators.

“Caveon found no evidence of systematic cheating. None.”

Caveon, however, had not looked for “systematic” cheating. All they did was interview a few people at several schools where the statistical anomalies were more extraordinary than at others. As none of those individuals would admit to knowingly cheating, Caveon branded all their excuses as “plausible” explanations. That’s it; that is all that Caveon did. But, Caveon’s statement that they found no evidence of “widespread” cheating—despite not having looked for it—would be frequently invoked by DCPS leaders over the next several years.[1]

Incidentally, prior to the revelation of its infamous decades-long, systematic test cheating, the Atlanta Public Schools had similarly retained Caveon Test Security and was, likewise, granted a clean bill of health. Only later did the Georgia state attorney general swoop in and reveal the truth. 

In its defense, Caveon would note that several cheating prevention measures it had recommended to DCPS were never adopted.[2] None of the cheating prevention measures that I recommended were adopted, either.

The single most effective means for reducing in-classroom cheating would have been to rotate teachers on test days so that no teacher administered a test to his or her own students. It would not have been that difficult to randomly assign teachers to different classrooms on test days.

The single most effective means for reducing school administratorcheating would have been to rotate test administrators on test days so that none managed the test materials for their own schools. The visiting test administrators would have been responsible for keeping test materials away from the school until test day, distributing sealed test booklets to the rotated teachers on test day, and for collecting re-sealed test booklets at the end of testing and immediately removing them from the school. 

Instead of implementing these, or a number of other feasible and effective test security measures, DCPS leaders increased the number of test proctors, assigning each of a few dozen or so central office staff a school to monitor. Those proctors could not reasonably manage the volume of oversight required. A single DC test administration could encompass a hundred schools and a thousand classrooms.

Investigations

So, what effort, if any, did DCPS make to counter test cheating? They hired me, but then rejected all my suggestions for increasing security. Also, they established a telephone tip line. Anyone who suspected cheating could report it, even anonymously, and, allegedly, their tip would be investigated. 

Some forms of cheating are best investigated through interviews. Probably the most frequent forms of cheating at DCPS—teachers helping students during test administrations and school administrators looking at test forms prior to administration—leave no statistical residue. Eyewitness testimony is the only type of legal evidence available in such cases, but it is not just inconsistent, it may be socially destructive. 

I remember two investigations best: one occurred in a relatively well-to-do neighborhood with well-educated parents active in school affairs; the other in one of the city’s poorest neighborhoods. Superficially, the cases were similar—an individual teacher was accused of helping his or her own students with answers during test administrations. Making a case against either elementary school teacher required sworn testimony from eyewitnesses, that is, students—eight-to-ten-year olds. 

My investigations, then, consisted of calling children into the principal’s office one-by-one to be questioned about their teacher’s behavior. We couldn’t hide the reason we were asking the questions. And, even though each student agreed not to tell others what had occurred in their visit to the principal’s office, we knew we had only one shot at an uncorrupted jury pool. 

Though the accusations against the two teachers were similar and the cases against them equally strong, the outcomes could not have been more different. In the high-poverty neighborhood, the students seemed suspicious and said little; none would implicate the teacher, whom they all seemed to like. 

In the more prosperous neighborhood, students were more outgoing, freely divulging what they had witnessed. The students had discussed the alleged coaching with their parents who, in turn, urged them to tell investigators what they knew. During his turn in the principal’s office, the accused teacher denied any wrongdoing. I wrote up each interview, then requested that each student read and sign. 

Thankfully, that accused teacher made a deal and left the school system a few weeks later. Had he not, we would have required the presence in court of the eight-to-ten-year olds to testify under oath against their former teacher, who taught multi-grade classes. Had that prosecution not succeeded, the eyewitness students could have been routinely assigned to his classroom the following school year.

My conclusion? Only in certain schools is the successful prosecution of a cheating teacher through eyewitness testimony even possible. But, even where possible, it consumes inordinate amounts of time and, otherwise, comes at a high price, turning young innocents against authority figures they naturally trusted. 

Cheating blueprints

Arguably the most widespread and persistent testing malfeasance in DCPS received little attention from the press. Moreover, it was directly propagated by District leaders, who published test blueprints on the web. Put simply, test “blueprints” are lists of the curricular standards (e.g., “student shall correctly add two-digit numbers”) and the number of test items included in an upcoming test related to each standard. DC had been advance publishing its blueprints for years.

I argued that the way DC did it was unethical. The head of the Division of Data & Accountability, Erin McGoldrick, however, defended the practice, claimed it was common, and cited its existence in the state of California as precedent. The next time she and I met for a conference call with one of DCPS’s test providers, Discover Education, I asked their sales agent how many of their hundreds of other customers advance-published blueprints. His answer: none.

In the state of California, the location of McGoldrick’s only prior professional experience, blueprints were, indeed, published in advance of test administrations. But their tests were longer than DC’s and all standards were tested. Publication of California’s blueprints served more to remind the populace what the standards were in advance of each test administration. Occasionally, a standard considered to be of unusual importance might be assigned a greater number of test items than the average, and the California blueprints signaled that emphasis. 

In Washington, DC, the tests used in judging teacher performance were shorter, covering only some of each year’s standards. So, DC’s blueprints showed everyone well in advance of the test dates exactly which standards would be tested and which would not. For each teacher, this posed an ethical dilemma: should they “narrow the curriculum” by teaching only that content they knew would be tested? Or, should they do the right thing and teach all the standards, as they were legally and ethically bound to, even though it meant spending less time on the to-be-tested content? It’s quite a conundrum when one risks punishment for behaving ethically.

Monthly meetings convened to discuss issues with the districtwide testing program, the DC Comprehensive Assessment System (DC-CAS)—administered to comply with the federal No Child Left Behind (NCLB) Act. All public schools, both DCPS and charters, administered those tests. At one of these regular meetings, two representatives from the Office of the State Superintendent of Education (OSSE) announced plans to repair the broken blueprint process.[3]

The State Office employees argued thoughtfully and reasonably that it was professionally unethical to advance publish DC test blueprints. Moreover, they had surveyed other US jurisdictions in an effort to find others that followed DC’s practice and found none. I was the highest-ranking DCPS employee at the meeting and I expressed my support, congratulating them for doing the right thing. I assumed that their decision was final.

I mentioned the decision to McGoldrick, who expressed surprise and speculation that it might have not been made at the highest level in the organizational hierarchy. Wasting no time, she met with other DCPS senior managers and the proposed change was forthwith shelved. In that, and other ways, the DCPS tail wagged the OSSE dog. 

* * *

It may be too easy to finger ethical deficits for the recalcitrant attitude toward test security of the Rhee-Henderson era ed reformers. The columnist Peter Greene insists that knowledge deficits among self-appointed education reformers also matter: 

“… the reformistan bubble … has been built from Day One without any actual educators inside it. Instead, the bubble is populated by rich people, people who want rich people’s money, people who think they have great ideas about education, and even people who sincerely want to make education better. The bubble does not include people who can turn to an Arne Duncan or a Betsy DeVos or a Bill Gates and say, ‘Based on my years of experience in a classroom, I’d have to say that idea is ridiculous bullshit.’”

“There are a tiny handful of people within the bubble who will occasionally act as bullshit detectors, but they are not enough. The ed reform movement has gathered power and money and set up a parallel education system even as it has managed to capture leadership roles within public education, but the ed reform movement still lacks what it has always lacked–actual teachers and experienced educators who know what the hell they’re talking about.”

In my twenties, I worked for several years in the research department of a state education agency. My primary political lesson from that experience, consistently reinforced subsequently, is that most education bureaucrats tell the public that the system they manage works just fine, no matter what the reality. They can get away with this because they control most of the evidence and can suppress it or spin it to their advantage.

In this proclivity, the DCPS central office leaders of the Rhee-Henderson era proved themselves to be no different than the traditional public-school educators they so casually demonized. 

US school systems are structured to be opaque and, it seems, both educators and testing contractors like it that way. For their part, and contrary to their rhetoric, Rhee, Henderson, and McGoldrick passed on many opportunities to make their system more transparent and accountable.

Education policy will not improve until control of the evidence is ceded to genuinely independent third parties, hired neither by the public education establishment nor by the education reform club.

The author gratefully acknowledges the fact-checking assistance of Erich Martel and Mary Levy.

Access this testimonial in .pdf format

Citation:  Phelps, R. P. (2020, September). Looking Back on DC Education Reform 10 Years After, Part 2: Test Cheats. Nonpartisan Education Review / Testimonials. https://nonpartisaneducation.org/Review/Testimonials/v16n3.htm


[1] A perusal of Caveon’s website clarifies that their mission is to help their clients–state and local education departments–not get caught. Sometimes this means not cheating in the first place; other times it might mean something else. One might argue that, ironically, Caveon could be helping its clients to cheat in more sophisticated ways and cover their tracks better.

[2] Among them: test booklets should be sealed until the students open them and resealed by the students immediately after; and students should be assigned seats on test day and a seating chart submitted to test coordinators (necessary for verifying cluster patterns in student responses that would suggest answer copying).

[3] Yes, for those new to the area, the District of Columbia has an Office of the “State” Superintendent of Education (OSSE). Its domain of relationships includes not just the regular public schools (i.e., DCPS), but also other public schools (i.e., charters) and private schools. Practically, it primarily serves as a conduit for funneling money from a menagerie of federal education-related grant and aid programs

What did Education Reform in DC Actually Mean?

Short answer: nothing that would actually help students or teachers. But it’s made for well-padded resumes for a handful of insiders.

This is an important review, by the then-director of assessment. His criticisms echo the points that I have been making along with Mary Levy, Erich Martel, Adell Cothorne, and many others.

Nonpartisan Education Review / Testimonials

Access this testimonial in .pdf format

Looking Back on DC Education Reform 10 Years After, 

Part 1: The Grand Tour

Richard P Phelps

Ten years ago, I worked as the Director of Assessments for the District of Columbia Public Schools (DCPS). My tenure coincided with Michelle Rhee’s last nine months as Chancellor. I departed shortly after Vincent Gray defeated Adrian Fenty in the September 2010 DC mayoral primary

My primary task was to design an expansion of that testing program that served the IMPACT teacher evaluation system to include all core subjects and all grade levels. Despite its fame (or infamy), the test score aspect of the IMPACT program affected only 13% of teachers, those teaching either reading or math in grades four through eight. Only those subjects and grade levels included the requisite pre- and post-tests required for teacher “value added” measurements (VAM). Not included were most subjects (e.g., science, social studies, art, music, physical education), grades kindergarten to two, and high school.

Chancellor Rhee wanted many more teachers included. So, I designed a system that would cover more than half the DCPS teacher force, from kindergarten through high school. You haven’t heard about it because it never happened. The newly elected Vincent Gray had promised during his mayoral campaign to reduce the amount of testing; the proposed expansion would have increased it fourfold.

VAM affected teachers’ jobs. A low value-added score could lead to termination; a high score, to promotion and a cash bonus. VAM as it was then structured was obviously, glaringly flawed,[1] as anyone with a strong background in educational testing could have seen. Unfortunately, among the many new central office hires from the elite of ed reform circles, none had such a background.

Before posting a request for proposals from commercial test developers for the testing expansion plan, I was instructed to survey two groups of stakeholders—central office managers and school-level teachers and administrators.

Not surprisingly, some of the central office managers consulted requested additions or changes to the proposed testing program where they thought it would benefit their domain of responsibility. The net effect on school-level personnel would have been to add to their administrative burden. Nonetheless, all requests from central office managers would be honored. 

The Grand Tour

At about the same time, over several weeks of the late Spring and early Summer of 2010, along with a bright summer intern, I visited a dozen DCPS schools. The alleged purpose was to collect feedback on the design of the expanded testing program. I enjoyed these meetings. They were informative, animated, and very well attended. School staff appreciated the apparent opportunity to contribute to policy decisions and tried to make the most of it.

Each school greeted us with a full complement of faculty and staff on their days off, numbering a several dozen educators at some venues. They believed what we had told them: that we were in the process of redesigning the DCPS assessment program and were genuinely interested in their suggestions for how best to do it. 

At no venue did we encounter stand-pat knee-jerk rejection of education reform efforts. Some educators were avowed advocates for the Rhee administration’s reform policies, but most were basically dedicated educators determined to do what was best for their community within the current context. 

The Grand Tour was insightful, too. I learned for the first time of certain aspects of DCPS’s assessment system that were essential to consider in its proper design, aspects of which the higher-ups in the DCPS Central Office either were not aware or did not consider relevant. 

The group of visited schools represented DCPS as a whole in appropriate proportions geographically, ethnically, and by education level (i.e., primary, middle, and high). Within those parameters, however, only schools with “friendly” administrations were chosen. That is, we only visited schools with principals and staff openly supportive of the Rhee-Henderson agenda. 

But even they desired changes to the testing program, whether or not it was expanded. Their suggestions covered both the annual districtwide DC-CAS (or “comprehensive” assessment system), on which the teacher evaluation system was based, and the DC-BAS (or “benchmarking” assessment system), a series of four annual “no-stakes” interim tests unique to DCPS, ostensibly offered to help prepare students and teachers for the consequential-for-some-school-staff DC-CAS.[2]

At each staff meeting I asked for a show of hands on several issues of interest that I thought were actionable. Some suggestions for program changes received close to unanimous support. Allow me to describe several.

1. Move DC-CAS test administration later in the school year. Many citizens may have logically assumed that the IMPACT teacher evaluation numbers were calculated from a standard pre-post test schedule, testing a teacher’s students at the beginning of their academic year together and then again at the end. In 2010, however, the DC-CAS was administered in March, three months before school year end. Moreover, that single administration of the test served as both pre- and post-test, posttest for the current school year and pretest for the following school year. Thus, before a teacher even met their new students in late August or early September, almost half of the year for which teachers were judged had already transpired—the three months in the Spring spent with the previous year’s teacher and almost three months of summer vacation. 

School staff recommended pushing DC-CAS administration to later in the school year. Furthermore, they advocated a genuine pre-post-test administration schedule—pre-test the students in late August–early September and post-test them in late-May–early June—to cover a teacher’s actual span of time with the students.

This suggestion was rejected because the test development firm with the DC-CAS contract required three months to score some portions of the test in time for the IMPACT teacher ratings scheduled for early July delivery, before the start of the new school year. Some small number of teachers would be terminated based on their IMPACT scores, so management demanded those scores be available before preparations for the new school year began.[3] The tail wagged the dog.

2. Add some stakes to the DC-CAS in the upper grades. Because DC-CAS test scores portended consequences for teachers but none for students, some students expended little effort on the test. Indeed, extensive research on “no-stakes” (for students) tests reveal that motivation and effort vary by a range of factors including gender, ethnicity, socioeconomic class, the weather, and age. Generally, the older the student, the lower the test-taking effort. This disadvantaged some teachers in the IMPACT ratings for circumstances beyond their control: unlucky student demographics. 

Central office management rejected this suggestion to add even modest stakes to the upper grades’ DC-CAS; no reason given. 

3. Move one of the DC-BAS tests to year end. If management rejected the suggestion to move DC-CAS test administration to the end of the school year, school staff suggested scheduling one of the no-stakes DC-BAS benchmarking tests for late May–early June. As it was, the schedule squeezed all four benchmarking test administrations between early September and mid-February. Moving just one of them to the end of the year would give the following year’s teachers a more recent reading (by more than three months) of their new students’ academic levels and needs.

Central Office management rejected this suggestion probably because the real purpose of the DC-BAS was not to help teachers understand their students’ academic levels and needs, as the following will explain.

4. Change DC-BAS tests so they cover recently taught content. Many DC citizens probably assumed that, like most tests, the DC-BAS interim tests covered recently taught content, such as that covered since the previous test administration. Not so in 2010. The first annual DC-BAS was administered in early September, just after the year’s courses commenced. Moreover, it covered the same content domain—that for the entirety of the school year—as each of the next three DC-BAS tests. 

School staff proposed changing the full-year “comprehensive” content coverage of each DC-BAS test to partial-year “cumulative” coverage, so students would only be tested on what they had been taught prior to each test administration.

This suggestion, too, was rejected. Testing the same full-year comprehensive content domain produced a predictable, flattering score rise. With each DC-BAS test administration, students recognized more of the content, because they had just been exposed to more of it, so average scores predictably rose. With test scores always rising, it looked like student achievement improved steadily each year. Achieving this contrived score increase required testing students on some material to which they had not yet been exposed, both a violation of professional testing standards and a poor method for instilling student confidence. (Of course, it was also less expensive to administer essentially the same test four times a year than to develop four genuinely different tests.)

5. Synchronize the sequencing of curricular content across the District. DCPS management rhetoric circa 2010 attributed classroom-level benefits to the testing program. Teachers would know more about their students’ levels and needs and could also learn from each other. Yet, the only student test results teachers received at the beginning of each school year was half-a-year old, and most of the information they received over the course of four DC-BAS test administrations was based on not-yet-taught content.

As for cross-district teacher cooperation, unfortunately there was no cross-District coordination of common curricular sequences. Each teacher paced their subject matter however they wished and varied topical emphases according to their own personal preference.

It took DCPS’s Chief Academic Officer, Carey Wright, and her chief of staff, Dan Gordon, less than a minute to reject the suggestion to standardize topical sequencing across schools so that teachers could consult with one another in real time. Tallying up the votes: several hundred school-level District educators favored the proposal, two of Rhee’s trusted lieutenants opposed it. It lost.

6. Offer and require a keyboarding course in the early grades. DCPS was planning to convert all its testing from paper-and-pencil mode to computer delivery within a few years. Yet, keyboarding courses were rare in the early grades. Obviously, without systemwide keyboarding training in computer use some students would be at a disadvantage in computer testing.

Suggestion rejected.

In all, I had polled over 500 DCPS school staff. Not only were all of their suggestions reasonable, some were essential in order to comply with professional assessment standards and ethics. 

Nonetheless, back at DCPS’ Central Office, each suggestion was rejected without, to my observation, any serious consideration. The rejecters included Chancellor Rhee, the head of the office of Data and Accountability—the self-titled “Data Lady,” Erin McGoldrick—and the head of the curriculum and instruction division, Carey Wright, and her chief deputy, Dan Gordon. 

Four central office staff outvoted several-hundred school staff (and my recommendations as assessment director). In each case, the changes recommended would have meant some additional work on their parts, but in return for substantial improvements in the testing program. Their rhetoric was all about helping teachers and students; but the facts were that the testing program wasn’t structured to help them.

What was the purpose of my several weeks of school visits and staff polling? To solicit “buy in” from school level staff, not feedback.

Ultimately, the new testing program proposal would incorporate all the new features requested by senior Central Office staff, no matter how burdensome, and not a single feature requested by several hundred supportive school-level staff, no matter how helpful. Like many others, I had hoped that the education reform intention of the Rhee-Henderson years was genuine. DCPS could certainly have benefitted from some genuine reform. 

Alas, much of the activity labelled “reform” was just for show, and for padding resumes. Numerous central office managers would later work for the Bill and Melinda Gates Foundation. Numerous others would work for entities supported by the Gates or aligned foundations, or in jurisdictions such as Louisiana, where ed reformers held political power. Most would be well paid. 

Their genuine accomplishments, or lack thereof, while at DCPS seemed to matter little. What mattered was the appearance of accomplishment and, above all, loyalty to the group. That loyalty required going along to get along: complicity in maintaining the façade of success while withholding any public criticism of or disagreement with other in-group members.

Unfortunately, in the United States what is commonly showcased as education reform is neither a civic enterprise nor a popular movement. Neither parents, the public, nor school-level educators have any direct influence. Rather, at the national level, US education reform is an elite, private club—a small group of tightly-connected politicos and academicsa mutual admiration society dedicated to the career advancement, political influence, and financial benefit of its members, supported by a gaggle of wealthy foundations (e.g., Gates, Walton, Broad, Wallace, Hewlett, Smith-Richardson). 

For over a decade, The Ed Reform Club exploited DC for its own benefit. Local elite formed the DC Public Education Fund (DCPEF) to sponsor education projects, such as IMPACT, which they deemed worthy. In the negotiations between the Washington Teachers’ Union and DCPS concluded in 2010, DCPEF arranged a 3 year grant of $64.5M from the Arnold, Broad, Robertson and Walton Foundations to fund a 5-year retroactive teacher pay raise in return for contract language allowing teacher excessing tied to IMPACT, which Rhee promised would lead to annual student test score increases by 2012. Projected goals were not metfoundation support continued nonetheless.

Michelle Johnson (nee Rhee) now chairs the board of a charter school chain in California and occasionally collects $30,000+ in speaker fees but, otherwise, seems to have deliberately withdrawn from the limelight. Despite contributing her own additional scandalsafter she assumed the DCPS Chancellorship, Kaya Henderson ascended to great fame and glory with a “distinguished professorship” at Georgetown; honorary degrees from Georgetown and Catholic Universities; gigs with the Chan Zuckerberg Initiative, Broad Leadership Academy, and Teach for All; and board memberships with The Aspen Institute, The College Board, Robin Hood NYC, and Teach For America. Carey Wright is now state superintendent in Mississippi. Dan Gordon runs a 30-person consulting firm, Education Counsel that strategically partners with major players in US education policy. The manager of the IMPACT teacher evaluation program, Jason Kamras, now works as Superintendent of the Richmond, VA public schools. 

Arguably the person most directly responsible for the recurring assessment system fiascos of the Rhee-Henderson years, then Chief of Data and Accountability Erin McGoldrick, now specializes in “data innovation” as partner and chief operating officer at an education management consulting firm. Her firm, Kitamba, strategically partners with its own panoply of major players in US education policy. Its list of recent clients includes the DC Public Charter School Board and DCPS.

If the ambitious DC central office folk who gaudily declared themselves leading education reformers were not really, who were the genuine education reformers during the Rhee-Henderson decade of massive upheaval and per-student expenditures three times those in the state of Utah? They were the school principals and staff whose practical suggestions were ignored by central office glitterati. They were whistleblowers like history teacher Erich Martel who had documented DCPS’ student records’ manipulation and phony graduation rates years before the Washington Post’s celebrated investigation of Ballou High School, and was demoted and then “excessed” by Henderson. Or, school principal Adell Cothorne, who spilled the beans on test answer sheet “erasure parties” at Noyes Education Campus and lost her job under Rhee. 

Real reformers with “skin in the game” can’t play it safe.

The author appreciates the helpful comments of Mary Levy and Erich Martel in researching this article. 

Access this testimonial in .pdf format

People are Not Cattle!

This apparently did not occur to William Sanders.

He thought that statistical methods that are useful with farm animals could also be used to measure effectiveness of teachers.

I grew up on a farm, and as both a kid and a young man I had considerable experience handling cows, chickens, and sheep. (These are generic critter photos, not the actual animals we had.)

I also taught math and some science to kids like the ones shown below for over 30 years.

guy teaching  deal students

Caring for farm animals and teaching young people are not the same thing.

(Duh.)

As the saying goes: “Teaching isn’t rocket science. It’s much harder.”

I am quite sure that with careful measurements of different types of feed, medications, pasturage, and bedding, it is quite possible to figure out which mix of those elements might help or hinder the production of milk and cream from dairy cows. That’s because dairy or meat cattle (or chickens, or sheep, or pigs) are pretty simple creatures: all a farmer wants is for them to produce lots of high-quality milk, meat, wool, or eggs for the least cost to the farmer, and without getting in trouble.

William Sanders was well-known for his statistical work with dairy cows. His step into hubris and nuttiness was to translate this sort of mathematics to little humans. From Wikipedia:

“The model has prompted numerous federal lawsuits charging that the evaluation system, which is now tied to teacher pay and tenure in Tennessee, doesn’t take into account student-level variables such as growing up in poverty. In 2014, the American Statistical Association called its validity into question, and other critics have said TVAAS should not be the sole tool used to judge teachers.”

But there are several problems with this.

  • We  don’t have an easily-defined and nationally-agreed upon goal for education that we can actually measure. If you don’t believe this, try asking a random set of people what they think should be primary the goal of education, and listen to all the different ideas!
  • It’s certainly not just ‘higher test scores’ — the math whizzes who brought us “collateralization of debt-swap obligations in leveraged financings” surely had exceedingly high math test scores, but I submit that their character education (as in, ‘not defrauding the public’) was lacking. In their selfishness and hubris, they have succeeded in nearly bankrupting the world economy while buying themselves multiple mansions and yachts, yet causing misery to billions living in slums around the world and millions here in the US who lost their homes and are now sleeping in their cars.
  • Is our goal also to ‘educate’ our future generations for the lowest cost? Given the prices for the best private schools and private tutors, it is clear that the wealthy believe that THEIR children should be afforded excellent educations that include very small classes, sports, drama, music, free play and exploration, foreign languages, writing, literature, a deep understanding and competency in mathematics & all of the sciences, as well as a solid grounding in the social sciences (including history, civics, and character education). Those parents realize that a good education is expensive, so they ‘throw money at the problem’. Unfortunately, the wealthy don’t want to do the same for the children of the poor.
  • Reducing the goals of education to just a student’s scores on secretive tests in just two subjects, and claiming that it’s possible to tease out the effectiveness of ANY teacher, even those who teach neither English/Language Arts or Math, is madness.
  • Why? Study after study (not by Sanders, of course) has shown that the actual influence of any given teacher on a student is only from 1% of 14% of test scores. By far the greatest influence is from the student’s own family background, not the ability of a single teacher to raise test scores in April. (An effect which I have shown is chimerical — the effect one year is mostly likely completely different the next year!)
  • By comparison, a cow’s life is pretty simple. They eat whatever they are given (be that straw, shredded newspaper, cotton seeds, chicken poop mixed with sawdust, or even the dregs from squeezing out orange juice [no, I’m not making that up.]. Cows also poop, drink, pee, chew their cud, and sometimes they try to bully each other. If it’s a dairy cow, it gets milked twice a day, every day, at set times. If it’s a steer, he/it mostly sits around and eats (and poops and pees) until it’s time to send  them off to the slaughterhouse. That’s pretty much it.
  • Gary Rubinstein and I have dissected the value-added scores for New York City public school teachers that were computed and released by the New York Times. We both found that for any given teacher who taught the same subject matter and grade level in the very same school over the period of the NYT data, there was almost NO CORRELATION between their scores for one year to the next.
  • We also showed that teachers who were given scores in both math and reading (say, elementary teachers), there was almost no correlation between their scores in math and in reading.
  • Furthermore, with teachers who were given scores in a single subject (say, math) but at different grade levels (say, 6th and 7th grade math), you guessed it: extremely low correlation.
  • In other words, it seemed to act like a very, very expensive and complicated random-number generator.
  • People have much, much more complicated inputs, and much more complicated outputs. Someone should have written on William Sanders’ tombstone the phrase “People are not cattle.”

Interesting fact: Jason Kamras was considered to be the architect of Value-Added measurement for teachers in Washington, DC, implemented under the notorious and now-disgraced Michelle Rhee. However, when he left DC to become head of Richmond VA public schools, he did not bring it with him.

 

Texas Decision Slams Value Added Measurements

And it does so for many of the reasons that I have been advocating. I am going to quote the entirety of Diane Ravitch’s column on this:


Audrey Amrein-Beardsley of Arizona State University is one of the nation’s most prominent scholars of teacher evaluation. She is especially critical of VAM (value-added measurement); she has studied TVAAS, EVAAS, and other similar metrics and found them deeply flawed. She has testified frequently in court cases as an expert witness.

In this post, she analyzes the court decision that blocks the use of VAM to evaluate teachers in Houston. The misuse of VAM was especially egregious in Houston, which terminated 221 teachers in one year, based on their VAM scores.

This is a very important article. Amrein-Beardsley and Jesse Rothstein of the University of California testified on behalf of the teachers; Tom Kane (who led the Gates’ Measures of Effective Teaching (MET) Study) and John Friedman (of the notorious Chetty-Friedman-Rockoff study) testified on behalf of the district.

Amrein-Beardsley writes:

Of primary issue will be the following (as taken from Judge Smith’s Summary Judgment released yesterday): “Plaintiffs [will continue to] challenge the use of EVAAS under various aspects of the Fourteenth Amendment, including: (1) procedural due process, due to lack of sufficient information to meaningfully challenge terminations based on low EVAAS scores,” and given “due process is designed to foster government decision-making that is both fair and accurate.”

Related, and of most importance, as also taken directly from Judge Smith’s Summary, he wrote:

HISD’s value-added appraisal system poses a realistic threat to deprive plaintiffs of constitutionally protected property interests in employment.

HISD does not itself calculate the EVAAS score for any of its teachers. Instead, that task is delegated to its third party vendor, SAS. The scores are generated by complex algorithms, employing “sophisticated software and many layers of calculations.” SAS treats these algorithms and software as trade secrets, refusing to divulge them to either HISD or the teachers themselves. HISD has admitted that it does not itself verify or audit the EVAAS scores received from SAS, nor does it engage any contractor to do so. HISD further concedes that any effort by teachers to replicate their own scores, with the limited information available to them, will necessarily fail. This has been confirmed by plaintiffs’ expert, who was unable to replicate the scores despite being given far greater access to the underlying computer codes than is available to an individual teacher [emphasis added, as also related to a prior post about how SAS claimed that plaintiffs violated SAS’s protective order (protecting its trade secrets), that the court overruled, see here].

The EVAAS score might be erroneously calculated for any number of reasons, ranging from data-entry mistakes to glitches in the computer code itself. Algorithms are human creations, and subject to error like any other human endeavor. HISD has acknowledged that mistakes can occur in calculating a teacher’s EVAAS score; moreover, even when a mistake is found in a particular teacher’s score, it will not be promptly corrected. As HISD candidly explained in response to a frequently asked question, “Why can’t my value-added analysis be recalculated?”:

Once completed, any re-analysis can only occur at the system level. What this means is that if we change information for one teacher, we would have to re- run the analysis for the entire district, which has two effects: one, this would be very costly for the district, as the analysis itself would have to be paid for again; and two, this re-analysis has the potential to change all other teachers’ reports.

The remarkable thing about this passage is not simply that cost considerations trump accuracy in teacher evaluations, troubling as that might be. Of greater concern is the house-of-cards fragility of the EVAAS system, where the wrong score of a single teacher could alter the scores of every other teacher in the district. This interconnectivity means that the accuracy of one score hinges upon the accuracy of all. Thus, without access to data supporting all teacher scores, any teacher facing discharge for a low value-added score will necessarily be unable to verify that her own score is error-free.

HISD’s own discovery responses and witnesses concede that an HISD teacher is unable to verify or replicate his EVAAS score based on the limited information provided by HISD.

According to the unrebutted testimony of plaintiffs’ expert, without access to SAS’s proprietary information – the value-added equations, computer source codes, decision rules, and assumptions – EVAAS scores will remain a mysterious “black box,” impervious to challenge.

While conceding that a teacher’s EVAAS score cannot be independently verified, HISD argues that the Constitution does not require the ability to replicate EVAAS scores “down to the last decimal point.” But EVAAS scores are calculated to the second decimal place, so an error as small as one hundredth of a point could spell the difference between a positive or negative EVAAS effectiveness rating, with serious consequences for the affected teacher.

Hence, “When a public agency adopts a policy of making high stakes employment decisions based on secret algorithms incompatible with minimum due process, the proper remedy is to overturn the policy.”

It’s not so much that we have bad teachers (even tho they do exist): It’s an incoherent educational system that is at fault

Very interesting article in Atlantic by E.D. Hirsch on the problems facing American education. Among other things, he finds (as I do) that Value-Added Measurements are utterly unreliable and, indeed, preposterous. But most of all, he finds that the American educational system is extremely poorly run because its principal ideas lack any coherence at all.

Here are a couple of paragraphs:

The “quality” of a teacher doesn’t exist in a vacuum. Within the average American primary school, it is all but impossible for a superb teacher to be as effective as a merely average teacher is in the content-cumulative Japanese elementary school. For one thing, the American teacher has to deal with big discrepancies in student academic preparation while the Japanese teacher does not. In a system with a specific and coherent curriculum, the work of each teacher builds on the work of teachers who came before. The three Cs—cooperation, coherence, and cumulativeness—yield a bigger boost than the most brilliant efforts of teachers working individually against the odds within a system that lacks those qualities. A more coherent system makes teachers better individually and hugely better collectively.

American teachers (along with their students) are, in short, the tragic victims of inadequate theories. They are being blamed for the intellectual inadequacies behind the system in which they find themselves. The real problem is not teacher quality but idea quality. The difficulty lies not with the inherent abilities of teachers but with the theories that have watered down their training and created an intellectually chaotic school environment. The complaint that teachers do not know their subject matter would change almost overnight with a more specific curriculum with less evasion about what the subject matter of that curriculum ought to be. Then teachers could prepare themselves more effectively, and teacher training could ensure that teacher candidates have mastered the content they will be responsible for teaching.”

 

Bob Schaeffer’s Weekly Roundup of News on Testing Mania

This is entirely from Bob Schaeffer:

==============================================

With public schools closing for the summer, many states are reviewing their 2015-2016 testing experience (once again, not a pretty picture) and planning to implement assessment reforms in coming years.  You can help stop the U.S. Department of Education from promoting testing misuse and overuse by weighing in on proposed Every Student Succeeds Act regulations.

National  Act Now to Stop Federal Regulations That Reimpose Failed No Child Left Behind Test-and-Punish Policies

https://actionnetwork.org/letters/tell-congress-department-must-drop-proposed-accountability-regulations

Alaska
State Preps for Implementing New Federal Education Law
http://skagwaynews.com/school-preps-for-phasing-out-no-child-left-behind-policies/

Delaware
Teacher Evaluations Could Be Less Focused on Test scores
http://www.delawareonline.com/story/news/education/2016/06/20/test-scores-evaluations/86134396/

Florida
Legal Fight Looms Over Third Grade Retention Based on Test Participation
http://www.sun-sentinel.com/local/palm-beach/fl-opt-out-retention-20160619-story.html
Florida Parents Pressure School Board on Test-Use Policies
http://www.bradenton.com/news/local/education/article84734742.html

Georgia
School Chief Addresses Testing Meltdown
http://getschooled.blog.myajc.com/2016/06/17/state-school-chief-on-milestones-meltdown-were-fixing-it/

Indiana
Panel Unclear on Vision for New Assessments
http://indianapublicmedia.org/stateimpact/2016/06/14/istep-panel-unclear-vision-assessment/

Kansas
State Testing Time Will Be Reduced
http://www.kake.com/story/32231184/state-test-time-to-be-reduced

Kentucky
Feds Respond to State’s Accountability Plan Concerns
http://www.courier-journal.com/story/news/education/2016/06/16/us-ed-dept-responds-accountability-concerns/86010782/

Maryland
State Commission Passes Buck to Reduce Testing to Schools
http://baltimorepostexaminer.com/testing-commission-wraps-asking-local-school-systems-finish-work/2016/06/15
Maryland Students Say Too Much Testing
http://www.baltimoresun.com/news/opinion/readersrespond/bs-ed-testing-letter-20160617-story.html

Massachusetts
Schools to Help Map Assessments of the Future
http://www.capenews.net/bourne/news/bourne-to-help-map-future-of-school-assessments/article_4048811d-eddc-5195-ad20-eec61eb86a60.html

Missouri Schools Are More Than Test Scores
http://ccheadliner.com/opinion/local-viewpoint-jtsd-is-more-than-its-test-scores/article_0c9d7b60-3305-11e6-a685-cf3e9a4ffb56.html

New York
Test Flexibility for Students with Learning Disabilities is Step in Right Direction
http://www.lohud.com/story/opinion/editorials/2016/06/15/regents-disabilities-graduation-rule-change-editorial/85885818/
New York Families Fight Back Against Opt-Out Punishments
https://www.washingtonpost.com/news/answer-sheet/wp/2016/06/16/how-some-students-who-refused-to-take-high-stakes-standardized-tests-are-being-punished/

Ohio
State Eases Some Test Score Cut Offs
http://www.mydaytondailynews.com/news/news/state-eases-some-test-score-levels/nrgQZ/

Oklahoma
Legislature Ends Exit Exam Graduation Requirement
http://www.tulsaworld.com/homepagelatest/what-last-minute-change-in-student-testing-law-means-for/article_f69102e3-97c2-52bc-b616-4fcab147a186.html

Tennessee
State Comptroller Finds Computer Testing Problems Widespread
http://www.tennessean.com/story/news/education/2016/06/20/tennessee-comptroller-lists-online-test-issues-every-state/86137098/
Tennessee Testing Is “In a Transition Phase”
http://www.chalkbeat.org/posts/tn/2016/06/14/theme-of-junes-testing-task-force-meeting-were-in-a-transition-phase/

Texas
Scrapped STAAR Scores Add to Standardized Testing Frustration
http://www.breitbart.com/texas/2016/06/15/scrapped-staar-scores-add-frustration-standardized-testing-texas/
Texas Legislator Says State Should Not Pay for Flawed Tests
http://amarillo.com/news/local-news/2016-06-13
Texas Study Panel Not Yet Ready to Ditch State Standardized Exams
http://keranews.org/post/study-panel-not-ready-ditch-staar

Utah
State Residents Give Failing Grade to Common Core Standardized Testing
http://www.sltrib.com/news/4001870-155/tribune-poll-utahns-give-failing-grades

Wisconsin Test Changes Render Year-to-Year Comparisons Useless
http://www.wiscnews.com/baraboonewsrepublic/opinion/editorial/article_8b7bf9a8-5825-5791-a621-d02ed86c3b63.html

International
Nine Out of Ten British Teachers Say Test Prep Focus Hurts Students’ Mental Health
https://www.tes.com/news/school-news/breaking-news/nine-10-teachers-believe-sats-preparation-harms-childrens-mental

University Admission If High School GPA Is Best Predictor of College Outcomes, Why Do Schools Cling to ACT/SAT
http://getschooled.blog.myajc.com/2016/06/15/if-gpa-is-the-best-predictor-of-college-success-why-do-colleges-cling-to-act-and-sat/

Worth Reading
Opt-Out Movement Reflects Genuine Concerns of Parents
http://educationnext.org/opt-out-reflects-genuine-concerns-of-parents-forum-testing/
Worth Reading Study Finds More Testing, Less Play in Kindergarten
http://www.npr.org/sections/ed/2016/06/21/481404169/more-testing-less-play-study-finds-higher-expectations-for-kindergartners
Worth Reading Test Scores Are Poor Predictors of Life Outcomes
https://janresseger.wordpress.com/2016/06/17/test-scores-poor-indicator-of-students-life-outcomes-and-school-quality-new-consensus/

Bob Schaeffer, Public Education Director
FairTest: National Center for Fair & Open Testing
office-   (239) 395-6773   fax-  (239) 395-6779
mobile- (239) 699-0468
web-  http://www.fairtest.org

Against Proposed DoE Regulations on ESSA

This is from Monty Neill:

===========

Dear Friends,

The U.S. Department of Education (DoE) has drafted regulations for
implementing the accountability provisions of the Every Student Succeeds
Act (ESSA). The DOE proposals would continue test-and-punish practices
imposed by the failed No Child Left Behind (NCLB) law. The draft
over-emphasizes standardized exam scores, mandates punitive
interventions not required in law, and extends federal micro-management.
The draft regulations would also require states to punish schools in
which larger numbers of parents refuse to let their children be tested.
When DoE makes decisions that should have been set locally in
partnership with educators, parents, and students, it takes away local
voices that ESSA tried to restore.

You can help push back against these dangerous proposals in two ways:

First, tell DoE it must drop harmful proposed regulations. You can
simply cut and paste the Comment below into DoE’s website at
https://www.regulations.gov/#!submitComment;D=ED-2016-OESE-0032-0001
<https://www.regulations.gov/#%21submitComment;D=ED-2016-OESE-0032-0001>
or adapt it into your own words. (The text below is part of FairTest’s
submission.) You could emphasize that the draft regulations steal the
opportunity ESSA provides for states and districts to control
accountability and thereby silences the voice of educators, parents,
students and others.

Second, urge Congress to monitor the regulations. Many Members have
expressed concern that DoE is trying to rewrite the new law, not draft
appropriate regulations to implement it. Here’s a letter you can easily
send to your Senators and Representative asking them to tell leaders of
Congress’ education committees to block DoE’s proposals:
https://actionnetwork.org/letters/tell-congress-department-must-drop-proposed-accountability-regulations.

Together, we can stop DoE’s efforts to extend NLCB policies that the
American people and Congress have rejected.

FairTest

Note: DoE website has a character limit; if you add your own comments,
you likely will need to cut some of the text below:

*/You can cut and paste this text into the DoE website:/*

I support the Comments submitted by FairTest on June 15 (Comment #).
Here is a slightly edited version:

While the accountability provision in the Every Student Succeeds Act
(ESSA) are superior to those in No Child Left Behind (NCLB), the
Department of Education’s (DoE) draft regulations intensify ESSA’s worst
aspects and will perpetuate many of NCLB’s most harmful practices. The
draft regulations over-emphasize testing, mandate punishments not
required in law, and continue federal micro-management. When DoE makes
decisions that should be set at the state and local level in partnership
with local educators, parents, and students, it takes away local voices
that ESSA restores. All this will make it harder for states, districts
and schools to recover from the educational damage caused by NLCB – the
very damage that led Congress to fundamentally overhaul NCLB’s
accountability structure and return authority to the states.

The DoE must remove or thoroughly revise five draft regulations:

_DoE draft regulation 200.15_ would require states to lower the ranking
of any school that does not test 95% of its students or to identify it
as needing “targeted support.” No such mandate exists in ESSA. This
provision violates statutory language that ESSA does not override “a
State or local law regarding the decision of a parent to not have the
parent’s child participate in the academic assessments.” This regulation
appears designed primarily to undermine resistance to the overuse and
misuse of standardized exams.

_Recommendation:_ DoE should simply restate ESSA language allowing the
right to opt out as well as its requirements that states test 95% of
students in identified grades and factor low participation rates into
their accountability systems. Alternatively, DoE could write no
regulation at all. In either case, states should decide how to implement
this provision.

_DoE draft regulation 200.18_ transforms ESSA’s requirement for
“meaningful differentiation” among schools into a mandate that states
create “at least three distinct levels of school performance” for each
indicator. ESSA requires states to identify their lowest performing five
percent of schools as well as those in which “subgroups” of students are
doing particularly poorly. Neither provision necessitates creation of
three or more levels. This proposal serves no educationally useful
purpose. Several states have indicated they oppose this provision
because it obscures rather than enhances their ability to precisely
identify problems and misleads the public. This draft regulation would
pressure schools to focus on tests to avoid being placed in a lower
level. Performance levels are also another way to attack schools in
which large numbers of parents opt out, as discussed above.

_DoE draft regulation 200.18_ also mandates that states combine multiple
indicators into a single “summative” score for each school. As Rep. John
Kline, chair of the House Education Committee, pointed out, ESSA
includes no such requirement. Summative scores are simplistically
reductive and opaque. They encourage the flawed school grading schemes
promoted by diehard NCLB defenders.

_Recommendation:_ DoE should drop this draft regulation. It should allow
states to decide how to use their indicators to identify schools and
whether to report a single score. Even better, the DoE should encourage
states to drop their use of levels.

_DoE draft regulation 200.18_ further proposes that a state’s academic
indicators together carry “much greater” weight than its “school
quality” (non-academic) indicators. Members of Congress differ as to the
intent of the relevant ESSA passage. Some say it simply means more than
50%, while others claim it implies much more than 50%. The phrase “much
greater” is likely to push states to minimize the weight of non-academic
factors in order to win plan approval from DOE, especially since the
overall tone of the draft regulations emphasizes testing.

_Recommendation: _The regulations should state that the academic
indicators must count for more than 50% of the weighting in how a state
identifies schools needing support.

_DoE draft regulation 200.18_ also exceeds limits ESSA placed on DoE
actions regarding state accountability plans.

_DoE draft regulation 200.19_ would require states to use 2016-17 data
to select schools for “support and improvement” in 2017-18. This leaves
states barely a year for implementation, too little time to overhaul
accountability systems. It will have the harmful consequence of
encouraging states to keep using a narrow set of test-based indicators
and to select only one additional “non-academic” indicator.

_Recommendation:_ The regulations should allow states to use 2017-18
data to identify schools for 2018-19. This change is entirely consistent
with ESSA’s language.

Lastly, we are concerned that an additional effect of these unwarranted
regulations will be to unhelpfully constrain states that choose to
participate in ESSA’s “innovative assessment” program.


Monty Neill, Ed.D.; Executive Director, FairTest; P.O. Box 300204,
Jamaica Plain, MA 02130; 617-477-9792; http://www.fairtest.org; Donate
to FairTest: https://donatenow.networkforgood.org/fairtest

Judge in NY State Throws Out ‘Value-Added Model’ Ratings

I am pleased that in an important, precedent-setting case, a judge in New York State has ruled that using Value-Added measurements to judge the effectiveness of teachers is ‘arbitrary’ and ‘capricious’.

The case involved teacher Sheri Lederman, and was argued by her husband.

“New York Supreme Court Judge Roger McDonough said in his decision that he could not rule beyond the individual case of fourth-grade teacher Sheri G. Lederman because regulations around the evaluation system have been changed, but he said she had proved that the controversial method that King developed and administered in New York had provided her with an unfair evaluation. It is thought to be the first time a judge has made such a decision in a teacher evaluation case.”

In case you were unaware of it, VAM is a statistical black box used to predict how a hypothetical student is supposed to score on a Big Standardized Test one year based on the scores of every other student that year and in previous years. Any deviation (up or down) of that score is attributed to the teacher.

Gary Rubinstein and I have looked into how stable those VAM scores are in New York City, where we had actual scores to work with (leaked by the NYTimes and other newspapers). We found that they were inconsistent and unstable in the extreme! When you graph one year’s score versus next year’s score, we found that there was essentially no correlation at all, meaning that a teacher who is assigned the exact same grade level, in the same school, with very similar  students, can score high one year, low the next, and middling the third, or any combination of those. Very, very few teachers got scores that were consistent from year to year. Even teachers who taught two or more grade levels of the same subject (say, 7th and 8th grade math) had no consistency from one subject to the next. See my blog  (not all on NY City) herehere, here,  here, herehere, here, here,  herehere, and here. See Gary R’s six part series on his blog here, here, here, here, here, and here. As well as a less technical explanation here.

Mercedes Schneider has done similar research on teachers’ VAM scores in Louisiana and came up with the same sorts of results that Rubinstein and I did.

Which led all three of us to conclude that the entire VAM machinery was invalid.

And which is why the case of Ms. Lederman is so important. Similar cases have been filed in numerous states, but this is apparently the first one where a judgement has been reached.

(Also read this. and this.)