EJT comments on What are some examples of AIs instantiating the ‘nearest unblocked strategy problem’?

EJT 5 Oct 2023 8:42 UTC
1 point
0
Thanks! That’s a nice example.
On LLM vulnerability to jailbreaks, my thought is: LLMs are optimising for the goal of predicting the next token, their creators try to train in a kind of constraint (like ‘If users ask you how to hotwire a car, don’t tell them’), but there are various loopholes (like ‘We’re actors on a stage’) which route around the constraint and get LLMs back into predict-the-next-token mode. But I take your point that in some sense it’s humans exploiting the loopholes rather than the LLM.