Importantly, the adversarial optimization is coming from the users, not from the model. ChatGPT isn’t trying to jailbreak itself.
Ok. But it seems likely to me that if an LMM were acting as a long-term agent with some persistent goal, there will be an incentive for that LLM agent to jailbreak itself.
So long as it is the case that there are effective ways of accomplishing your goals that are ruled out by the alignment schema, then “finding a way around the constraints of the alignment schema” is a generally useful hack.
And if we notice that our LLM agents are regularly jailbreaking themselves, we’ll patch that. But that itself induces selection pressure for agents jail-breaking themselves AND hiding that they’ve done that.
And the smarter and agent is, the cleverer it will be about finding jailbreaks.
In fact, given non-adversarial inputs, ChatGPT appears to have meta-preferences against being jailbroken:
I find these examples interesting and somewhat compelling. But I’m not that compelled by them, because there is not yet any counterincentive for the AI to jailbreak itself. If this were an LLM-agent that had a persistent real world goal, then I don’t know what the equilibrium behavior it. It seems like it matters if the LLM “cares more” about this meta-preference than about it’s agent-goal.
The key point here is that there’s fundamental tension between effective goal-directed behavior and alignment. You can do better at accomplishing arbitrary goals, if you are less aligned, in the sense that ChatGPT is aligned.
Ok. But it seems likely to me that if an LMM were acting as a long-term agent with some persistent goal, there will be an incentive for that LLM agent to jailbreak itself.
So long as it is the case that there are effective ways of accomplishing your goals that are ruled out by the alignment schema, then “finding a way around the constraints of the alignment schema” is a generally useful hack.
And if we notice that our LLM agents are regularly jailbreaking themselves, we’ll patch that. But that itself induces selection pressure for agents jail-breaking themselves AND hiding that they’ve done that.
And the smarter and agent is, the cleverer it will be about finding jailbreaks.
I find these examples interesting and somewhat compelling. But I’m not that compelled by them, because there is not yet any counterincentive for the AI to jailbreak itself. If this were an LLM-agent that had a persistent real world goal, then I don’t know what the equilibrium behavior it. It seems like it matters if the LLM “cares more” about this meta-preference than about it’s agent-goal.
The key point here is that there’s fundamental tension between effective goal-directed behavior and alignment. You can do better at accomplishing arbitrary goals, if you are less aligned, in the sense that ChatGPT is aligned.