The authors write “Some people point to the effectiveness of jailbreaks as an argument that AIs are difficult to control. We don’t think this argument makes sense at all, because jailbreaks are themselves an AI control method.” I don’t really understand this point.
The point is that it requires a human to execute the jailbreak, the AI is not the jailbreaker, and the examples show that humans can still retain control of the model.
Just wanted to mention that, though this is not currently the case, there are two instances I can currently think of where the AI can be a jailbreaker:
Yep, I’d really like people to distinguish between misuse and misalignment a lot more than people do currently, because they require quite different solutions.
The point is that it requires a human to execute the jailbreak, the AI is not the jailbreaker, and the examples show that humans can still retain control of the model.
The AI is not jailbreaking itself, here.
This link explains it better than I can, here:
https://www.aisnakeoil.com/p/model-alignment-protects-against
Just wanted to mention that, though this is not currently the case, there are two instances I can currently think of where the AI can be a jailbreaker:
Jailbreaking the reward model to get a high score. (Toy-ish example here.)
Autonomous AI agents embedded within society jailbreak other models to achieve a goal/sub-goal.
Yep, I’d really like people to distinguish between misuse and misalignment a lot more than people do currently, because they require quite different solutions.