I recently read “The School to Prison Pipeline: Long-Run Impacts of School Suspensions on Adult Crime” (Bacher-Hicks et. al. 2019, pdf, via Rob Wiblin) which argues that a policy of suspending kids in middle school leads to more crime as an adult.
Specifically, they found that after controlling for a bunch of things, students who attended schools with 0.38 more suspensions per student per year were 20% more likely to be jailed as adults:
A one standard deviation increase in the estimated school effect increases the average annual number of days suspended per year by 0.38, a 16 percent increase. … We find that students assigned a school with a 1 standard deviation higher suspension effect are about 3.2 percentage points more likely to have ever been arrested and 2.5 percentage points more likely to have ever been incarcerated, which correspond to an increase of 17 percent and 20 percent of their respective sample means.
This is a very surprising outcome: from a single suspension in three years they’re 20% more likely to go to jail?
The authors look at the Charlotte-Mecklenburg school district, was ordered by the court to desegregate in the 1970s. In the early 2000s the court was convinced that busing wasn’t needed anymore, and the district implemented a “School Choice Plan” for beginning of the 2002 year. Students were massively shuffled between the schools and, while this was generally not randomized, the authors describe it as a “natural experiment”.
The idea is that if a student moves from school A to school B and you know how often students were suspended at both schools, then you can look at differences later in life and see how much of that is explained by the difference in suspension rates. They note:
A key concern is whether variation in “strictness” across schools arises from policy choices made by administrators versus underlying variation in school context. Our use of the boundary change partly addresses this concern, because we show that schools’ conditional suspension rates remain highly correlated through the year of the boundary change, which provides a very large shock to school context. We also show that school effects on suspensions are unrelated to other measures of school quality, such as achievement growth, teacher turnover and peer characteristics.And:
We also test directly for the importance of administrative discretion by exploiting a second source of variation—principal movement across schools. We find that conditional suspension rates change substantially when new principals enter and exit, and that principals’ effects on suspensions in other schools predict suspensions in their current schools. While we ultimately cannot directly connect our estimates to concrete policy changes, the balance of the evidence suggests that principals exert considerable influence over school discipline and that our results cannot be explained by context alone.Here’s an alternative model that fits this data, which I think is much more plausible. Grant that differences in conditional suspension rates are mostly caused by administrators’ policy preferences, but figure that student-specific effects still play a role. Then figure there are differences between the schools’ cultures or populations that are not captured by the controls, and that these differences cause both (a) differences in the student-specific portion of the suspension rate and (b) differences in adult incarceration rates. If suspensions themselves had no effects we would still see suspension appearing to cause higher incarceration rates later in life.
They refer to movement of principals between schools, which offers a way to test this. Classify principals by their suspension rates, and look at schools that had a principal change while keeping the student body constant. Ideally do this in school districts where the parents don’t have a choice about which school their children attend, to remove the risk that the student population before and after the principal change is different in aggregate. Compare the adult outcomes of students just before the change to ones just after. While a principal could affect school culture in multiple ways and we would attribute the entire effect to suspensions, this would at least let us check whether the differences are coming from the administration.
This sort of problem, where there’s some kind of effect outside what you control for, which leads you to find causation where there may not be any, is a major issue for value-added models (VAM) in general. “Do Value Added Models Add Value?” (Rothstein 2010, pdf) and “Teacher Effects on Student Achievement and Height” (Bitler et. al. 2019, pdf) are two good papers on this. The first shows that a VAM approach yields higher grades in later years causing higher grades in earlier years, while the second shows the same for teachers causing their students to be taller.
I continue to think we put way too much stock in complex correlational studies, but Bacher-Hicks is an illustration of the way the “natural experiment” label can be used even for things that aren’t very experiment-like. It’s not a coincidence that at my day job, with lots of money on the line, we run extensive randomized controlled trials and almost never make decisions based on correlational evidence. I would like to see a lot more actual randomization in things like which teachers or schools people are assigned to; this would be very helpful for understanding what actually has what effects.
It would be nicer if there were more randomization, but it would also be nicer if more information were extracted from the few people who are randomized. For example, I know someone who participated in an RCT of breastfeeding/formula. It was aimed at a specific (acute, adverse) infant outcome. I’m not sure it even looked at other infant metrics, but it certainly did not have long-term follow-up, not even at 5 years. Not only did the study make a big investment in persuading the subjects for such little measurement, but it is now impossible to do a better experiment, because RCTs of breastfeeding are now considered unethical because of the damage their null results do to the authors’ careers. (Similarly the Swedish and Australian twin registries are the right way to do twin studies.)
On the other hand, sometimes you can’t randomize and you’d like to know how well you can do correlational studies. If your employer is so enthusiastic about experiments, maybe it apply that enthusiasm to itself and do an experiment to see how well its employees can do observational analysis?