I agree that we should ideally aim for it being the case that the model has no interest in rebelling—that would be the “Alignment case” that I talk about as one of the possible affirmative safety cases.
Yeah! My point is more “let’s make it so that the possible failures on the way there are graceful”. Like, IF you made par-human agent that wants to, I don’t know, spam the internet with letter M, you don’t just delete it or rewrite it to be helpful, harmless, and honest instead, like it’s nothing. So we can look back at this time and say “yeah, we made a lot of mad science creatures on the way there, but at least we treated them nicely”.
I agree that we should ideally aim for it being the case that the model has no interest in rebelling—that would be the “Alignment case” that I talk about as one of the possible affirmative safety cases.
Yeah! My point is more “let’s make it so that the possible failures on the way there are graceful”. Like, IF you made par-human agent that wants to, I don’t know, spam the internet with letter M, you don’t just delete it or rewrite it to be helpful, harmless, and honest instead, like it’s nothing. So we can look back at this time and say “yeah, we made a lot of mad science creatures on the way there, but at least we treated them nicely”.