Noosphere89 comments on Thoughts on “AI is easy to control” by Pope & Belrose

Noosphere89 1 Dec 2023 20:51 UTC
LW: 22 AF: 11
5
AF

The authors write “Some people point to the effectiveness of jailbreaks as an argument that AIs are difficult to control. We don’t think this argument makes sense at all, because jailbreaks are themselves an AI control method.” I don’t really understand this point.

The point is that it requires a human to execute the jailbreak, the AI is not the jailbreaker, and the examples show that humans can still retain control of the model.

The AI is not jailbreaking itself, here.

This link explains it better than I can, here:

https://www.aisnakeoil.com/p/model-alignment-protects-against
- jacquesthibs 2 Dec 2023 13:34 UTC
  7 points
  0
  Parent
  Just wanted to mention that, though this is not currently the case, there are two instances I can currently think of where the AI can be a jailbreaker:
  1. Jailbreaking the reward model to get a high score. (Toy-ish example here.)
  2. Autonomous AI agents embedded within society jailbreak other models to achieve a goal/sub-goal.
  - Noosphere89 2 Dec 2023 16:11 UTC
    4 points
    2
    Parent
    Yep, I’d really like people to distinguish between misuse and misalignment a lot more than people do currently, because they require quite different solutions.