evhub comments on Catastrophic sabotage as a major threat model for human-level AI systems

evhub 3 Dec 2024 21:02 UTC
4 points
2
I agree that we should ideally aim for it being the case that the model has no interest in rebelling—that would be the “Alignment case” that I talk about as one of the possible affirmative safety cases.
- weightt an 3 Dec 2024 21:20 UTC
  1 point
  0
  Parent
  Yeah! My point is more “let’s make it so that the possible failures on the way there are graceful”. Like, IF you made par-human agent that wants to, I don’t know, spam the internet with letter M, you don’t just delete it or rewrite it to be helpful, harmless, and honest instead, like it’s nothing. So we can look back at this time and say “yeah, we made a lot of mad science creatures on the way there, but at least we treated them nicely”.