Why do evaluations tend to find that social programs don't work?
As we head into the final week of our course on AI Evaluation, we’re reading Peter Rossi’s 1987 paper on social program evaluation, “The Iron Law of Evaluation and Other Metallic Rules.” Rossi is a fitting cap on the semester because Deb and I adopted Rossi’s definition of evaluation as the backbone of our course. It’s also a key reference as we begin to understand how to evaluate human-facing AI. We need to consider how to evaluate claims about the impact of AI systems on people. Do they augment or inhibit the capabilities of students, healthcare providers, or software engineers?
Rossi’s paper begins with a conundrum. He notes that rigorous program evaluation has only become a “routine” part of policy development in the last two decades (i.e., since the 1960s). However, the more evaluation becomes routine, the more it seems like nothing works. He lays out four “laws” of program evaluation that describe the state of affairs in the late 1980s. Rossi is clear that these “laws” are also socioscientific, so they don’t hold like the laws of physics. But they are a common enough state of affairs that they highlight issues in policy construction.
The laws are named after metals in decreasing order of robustness. The first law describes the general state of affairs of program evaluation: on average, nothing works.
The Iron Law of Evaluation: The expected value of any net impact assessment of any large scale social program is zero.
The second law is harsher, suggesting that the more skill, effort, and thought put into an evaluation, the less likely it is to find a benefit:
The Stainless Steel Law of Evaluation: The better designed the impact assessment of a social program, the more likely is the resulting estimate of net impact to be zero.
The third law argues that social scientific methods themselves are far less strongly influential than social scientists believe.
The Brass Law of Evaluation: The more social programs are designed to change individuals, the more likely the net impact of the program will be zero.
The final law is an optimistic assessment about selective evaluation.
The Zinc Law of Evaluation: Only those programs that are likely to fail are evaluated.
These glib soundbites still ring true. But the question remains, why do well-evaluated social programs fail to show benefit? Rossi has a peculiar answer. Most who indict the effectiveness of social scientific practice wonder if there’s an issue with the entire enterprise. Rossi, on the other hand, argues for doubling down. He claims that an improved social scientific practice will improve the effectiveness of programs. He points to several common flaws in policy implementation, where faulty understandings of social phenomena, faulty understandings of how to translate theory into intervention, or faulty implementations themselves could be corrected by better adherence to social scientific theory. More ambitiously, Rossi wants social science to build out a discipline of social engineering, drawing closer connections to practitioners in education, psychology, and public administration. He demands more rigorous and quantitative evaluation, even if that means fewer programs will be assessed as beneficial. He warns against the qualitative as having “even greater and more obvious deficiencies” than the randomized controlled trial.
Rossi closes his paper, concluding, “There are no social science equivalents of the Salk vaccine.” But he takes this as evidence of a lack of methodological discipline. Perhaps with better theory, better evaluation, and better link to practice, the field can get there.
In a set of remarks delivered in 2003 at the Association for Public Policy and Management Research Conference, Rossi in fact argues that more rigorous quantitative methods have brought social science to a better place. Without references, he asserts:
“There are quite a large number of well conducted impact assessments that yield statistically and substantively significant effect sizes. I believe that we are learning how properly to design and implement interventions that are effective.”
In these later remarks, he suggests that the reason programs continue to fail to evaluate is because of poor evaluations, not because the underlying methodology doesn’t work. If only people were more rigorous with how they did their evaluations, he thinks the laws would be refuted.
The subsequent decades have not been kind to Rossi’s predictions. Not only do social program evaluations still fail to find positive net benefit on average, but public discontent with this technocratic mindset has reached an all-time high.
While the observations of this original paper seem irrefutable, Rossi’s diagnosis was wrong. What if it wasn’t the methods holding back social science, but the entire conception of the notion of social science? Many of Rossi’s contemporaries were sounding the alarm. The 1980s are full of papers reckoning with the failures of quantitative social science. Leamer’s “Let’s Take the Con out of Econometrics.” Gigerenzer et al.’s “The Empire of Chance.” Stanley Liberson’s “Making It Count.” Meehl’s “Two Knights” paper. In the popular press, the work of Neil Postman. All of these authors came to the same conclusion: social science was something different from natural science, and hence the methods of the natural sciences, especially those of control, don’t apply to social systems and don’t align with liberal values.
Oddly, instead of finding a constructive path forward, most of social science doubled down with Rossi for forty years, accumulating more data, more methods, more computing. The reason Rossi’s paper remains a classic reading is that social programs still seem to obey the iron rules. Despite massive amounts of data, plots, and regressions, “social engineering” isn’t more effective than it was in 1987. Those Iron Laws still look pretty spot on, and it’s past time to consider that those 1980s critics of modern social science were right all along.
Subscribe now
By Ben Recht