Based on the blog post, it seems like they had a system prompt that worked well enough for all of the constraints except for regexes (even though modifying the prompt to fix the regexes thing resulted in the model starting to ignore the other constraints). So it seems like the goal here was to do some custom thing to fix just the regexes (without otherwise impeding the model’s performance, include performance at following the other constraints).
(Note that using SAEs to fix lots of behaviors might also have additional downsides, since you’re doing a more heavy-handed intervention on the model.)
Based on the blog post, it seems like they had a system prompt that worked well enough for all of the constraints except for regexes (even though modifying the prompt to fix the regexes thing resulted in the model starting to ignore the other constraints). So it seems like the goal here was to do some custom thing to fix just the regexes (without otherwise impeding the model’s performance, include performance at following the other constraints).
(Note that using SAEs to fix lots of behaviors might also have additional downsides, since you’re doing a more heavy-handed intervention on the model.)