I’m somewhat more pessimistic than I was in the past about the prospects for catch-then-shut-down. In particular, I think you might need to catch many escape attempts before you can make a strong case for shutting down. (For concreteness, I mostly imagine situations where we need to catch the model trying to escape 30 times.)
I am still optimistic about few-shot catastrophe prevention; this post is just about whether you can shut down, not whether you can take advantage of the escapes to improve safety (if you try to).
As you note, one difference between this post and my previous claims is that in the past I’ve mostly talked about what an AI lab would do to mitigate danger from its own models, and here I also talk about other labs’ models.
(Note also that the old posts you linked were coauthored with Ryan, so there’s some variation from the fact that he and I don’t have exactly the same beliefs.)
In particular, I think you might need to catch many escape attempts before you can make a strong case for shutting down. (For concreteness, I mostly imagine situations where we need to catch the model trying to escape 30 times.)
So instead of leaving the race once the models start scheming against you, you keep going to gather more instances of scheming until you can finally convince people? As an outside reader of that story I’d just be screaming at the protagonists that clearly everyone can see where this is going where scheming attempt number 11 is just good enough to be successful. And in the worlds where we catch them 30 times successfully it feels like people would argue: this is clear evidence that the models aren’t “actually dangerous” yet, so let’s keep scaling “responsibly”.
I’m somewhat more pessimistic than I was in the past about the prospects for catch-then-shut-down. In particular, I think you might need to catch many escape attempts before you can make a strong case for shutting down. (For concreteness, I mostly imagine situations where we need to catch the model trying to escape 30 times.)
I am still optimistic about few-shot catastrophe prevention; this post is just about whether you can shut down, not whether you can take advantage of the escapes to improve safety (if you try to).
As you note, one difference between this post and my previous claims is that in the past I’ve mostly talked about what an AI lab would do to mitigate danger from its own models, and here I also talk about other labs’ models.
(Note also that the old posts you linked were coauthored with Ryan, so there’s some variation from the fact that he and I don’t have exactly the same beliefs.)
So instead of leaving the race once the models start scheming against you, you keep going to gather more instances of scheming until you can finally convince people? As an outside reader of that story I’d just be screaming at the protagonists that clearly everyone can see where this is going where scheming attempt number 11 is just good enough to be successful. And in the worlds where we catch them 30 times successfully it feels like people would argue: this is clear evidence that the models aren’t “actually dangerous” yet, so let’s keep scaling “responsibly”.