Could you say more about why you think LLMs’ vulnerability to jailbreaks count as an example? Intuitively, the idea that jailbreaks are an instance of AIs (rather than human jailbreakers) “optimizing for small loopholes in aligned constraints” feels off to me.
A bit more constructively, the Learning to Play Dumb example (from pages 8-9 in this paper) might be one example of what you’re looking for?
In research focused on understanding how organisms evolve to cope with high-mutation-rate environments, Ofria sought to disentangle the beneficial effects of performing tasks (which would allow an organism to execute its code faster and thus replicate faster) from evolved robustness to the harmful effect of mutations. To do so, he tried to disable mutations that improved an organism’s replication rate (i.e. its fitness). He configured the system to pause every time a mutation occurred, and then measured the mutant’s replication rate in an isolated test environment. If the mutant replicated faster than its parent, then the system eliminated the mutant; otherwise, the mutant would remain in the population.
However, while replication rates at first remained constant, they later unexpectedly started again rising. After a period of surprise and confusion, Ofria discovered that he was not changing the inputs provided to the organisms in the isolated test environment. The organisms had evolved to recognize those inputs and halt their replication. Not only did they not reveal their improved replication rates, but they appeared to not replicate at all, in effect “playing dead” when presented with what amounted to a predator.
Ofria then … [altered] the test environment to match the same random distribution of inputs as would be experienced in the normal (non-isolated) environment. While this patch improved the situation, it did not stop the digital organisms from continuing to improve their replication rates. Instead they made use of randomness to probabilistically perform the tasks that accelerated their replication. For example, if they did a task half of the time, they would have a 50% chance of slipping through the test environment; then, in the actual environment, half of the organisms would survive and subsequently replicate faster.
On LLM vulnerability to jailbreaks, my thought is: LLMs are optimising for the goal of predicting the next token, their creators try to train in a kind of constraint (like ‘If users ask you how to hotwire a car, don’t tell them’), but there are various loopholes (like ‘We’re actors on a stage’) which route around the constraint and get LLMs back into predict-the-next-token mode. But I take your point that in some sense it’s humans exploiting the loopholes rather than the LLM.
Could you say more about why you think LLMs’ vulnerability to jailbreaks count as an example? Intuitively, the idea that jailbreaks are an instance of AIs (rather than human jailbreakers) “optimizing for small loopholes in aligned constraints” feels off to me.
A bit more constructively, the Learning to Play Dumb example (from pages 8-9 in this paper) might be one example of what you’re looking for?
Thanks! That’s a nice example.
On LLM vulnerability to jailbreaks, my thought is: LLMs are optimising for the goal of predicting the next token, their creators try to train in a kind of constraint (like ‘If users ask you how to hotwire a car, don’t tell them’), but there are various loopholes (like ‘We’re actors on a stage’) which route around the constraint and get LLMs back into predict-the-next-token mode. But I take your point that in some sense it’s humans exploiting the loopholes rather than the LLM.