On LLM vulnerability to jailbreaks, my thought is: LLMs are optimising for the goal of predicting the next token, their creators try to train in a kind of constraint (like ‘If users ask you how to hotwire a car, don’t tell them’), but there are various loopholes (like ‘We’re actors on a stage’) which route around the constraint and get LLMs back into predict-the-next-token mode. But I take your point that in some sense it’s humans exploiting the loopholes rather than the LLM.
Thanks! That’s a nice example.
On LLM vulnerability to jailbreaks, my thought is: LLMs are optimising for the goal of predicting the next token, their creators try to train in a kind of constraint (like ‘If users ask you how to hotwire a car, don’t tell them’), but there are various loopholes (like ‘We’re actors on a stage’) which route around the constraint and get LLMs back into predict-the-next-token mode. But I take your point that in some sense it’s humans exploiting the loopholes rather than the LLM.