Sam Svenningsen comments on Inducing Unprompted Misalignment in LLMs

Sam Svenningsen 20 Apr 2024 1:05 UTC
5 points
4
I’m not saying the frog is boiling, it is just warmer than previous related work I had seen had measured.
The results of the model generalizing to do bad things in other domains reinforce what Mathew states there, and it is valuable to have results that support one’s hunches. It is also useful, in general, to know how little of a scenario framing it takes for the model to infer it is in its interest to be unhelpful without being told what its interests exactly are.
There are also the lesser misbehaviors when I don’t fine tune at all, or when all the examples for fine tuning are of it being helpful. These don’t involve telling the model to be bad at all.
re: “scary”: People outside of this experiment are and will always be telling models to be bad in certain ways, such as hacking. Hacking LLMs are a thing that exists and will continue to exist. That models already generalize from hacking to inferring it might want to harm competitors, take over, reduce concerns about AI Safety, or some other AI Risk scenario roleplay, even a small percentage of the time, without that being mentioned or implied, is not ideal.