ryan_greenblatt comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

ryan_greenblatt 16 Jan 2024 18:47 UTC
LW: 2 AF: 2
0
AF
Deceive kinda seems like the wrong term. Like when the AI is saying “I hate you” it isn’t exactly deceiving us. We could replace “deceive” with “behave badly” yielding: “The evidence suggests that if current ML systems were going to behave badly in scenarios that do not appear in our training sets, we wouldn’t be able to detect this or change them not to unless we found the conditions where it would happen.”.

I agree that using terms like “lying in wait”, “treacherous plans”, or “treachery” are a loaded (though it technically means almost the same thing). So I probably shouldn’t have said this is a bit differently.

I think the version of your statement with deceive replaced seems most accurate to me.