I think once you’re doing few-shot catastrophe prevention and trying to get useful work out of that model, you’re plausibly in the “squeezing useful work out of clearly misaligned models regime”. (Though it’s not clear that squeezing is a good description and you might think that your few-shot catastrophe prevention interventions have a high chance of removing scheming. I’m generally skeptical about removing catastrophic misalignment just based on one escape attempt, but I think that we still might be able to greatly improve safety via other mechanisms.)
I think once you’re doing few-shot catastrophe prevention and trying to get useful work out of that model, you’re plausibly in the “squeezing useful work out of clearly misaligned models regime”. (Though it’s not clear that squeezing is a good description and you might think that your few-shot catastrophe prevention interventions have a high chance of removing scheming. I’m generally skeptical about removing catastrophic misalignment just based on one escape attempt, but I think that we still might be able to greatly improve safety via other mechanisms.)
As I discuss in my comment responding to the sibling comment from habryka, I’m interested in both of ensuring direct safety and getting useful work from scheming models.