Evaluating an Evaluation System: Lessons From a <i>New York Times</i> Graphic

May 20, 2015 by Marni Bromberg

Teacher evaluation systems are complex. There are a number of measures available — such as classroom observations, student achievement growth, and survey results — to rate a teacher’s performance, and each one has its own insight to add. Districts should keep that in mind, particularly as they (and policymakers and the news media) grapple with which measures matter.

Consider this New York Times graphic, depicting inconsistencies in two types of teacher evaluation scores. Titled “Where Districts Are Rating Their Teachers Much Higher Than the State Is,” the graphic implies that some districts are inflating evaluation scores, since metrics derived from state assessment results showed something different when compared with district scores.

But the graphic may have implied too much.

Here’s why: For each district, the graphic compares the percent of educators rated highly effective using two different types of scores:

District composite evaluation scores, which include multiple measures of student achievement and teacher practice; and
State-computed value-added scores, or student growth on state tests.

This is a problem for two reasons:

Only about 20 percent of educators receive a value-added score, whereas all educators receive a composite score. So the graphic is comparing two different groups of educators.
Each district makes decisions about how they want to compute their composite evaluation score, so these scores are not comparable from district to district. And, the degree of parity between each district’s composite score and the state value-added score depends greatly on the decisions districts make.

Although composite evaluation scores are computed differently in every district, there is a common framework: Scores must include the state value-added score for educators of tested subjects. (For other teachers, the value-added score is substituted with a different growth score.) Composite scores also include a separate measure of student achievement, which is determined locally.

However, the bulk of the composite score comprises non-achievement measures, one of which must be observations of teacher practice. Other measures can include survey results or teacher portfolios, as determined by the district. Let’s look at an example of effectiveness ratings using state value-added scores, district composite scores, and the components within the composite scores.

In New York City, value-added and composite results are relatively similar because about the same percentage of teachers are rated highly effective by value-added scores, growth scores, and local achievement measures.

By contrast, in Cold Spring Harbor on Long Island, no teachers were rated highly effective by value-added scores in 2014, but 96 percent were rated highly effective by composite scores. This occurred because teachers were rated very highly along other pieces of the evaluation: 99 percent were highly effective on the locally selected achievement measure, which was based on a local performance-based assessment, and 96 percent earned a highly effective designation on their observation score, which was based on a well-developed and commonly utilized rubric to assess teacher practice. While these measures capture important qualities about teacher practice, the results raise a question: Are the benchmarks for “highly effective” sufficiently high to truly capture excellence?

The NYT graphic implied that value-added scores represent the true quality of teachers, whereas composite scores are less objective. A deeper look at the data, however, suggests that the metrics are different by design. Nonetheless, the graphic raises some questions that states and districts should consider as they decide on the best evaluation system for their teachers and schools:

Do the measures reflect qualities of teacher practice that matter for student outcomes?
Are the measures tied to benchmarks that truly differentiate between satisfactory performance and excellence?

These are tough questions to grapple with but important for providing quality feedback to educators.