Zvi comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Zvi 16 Jan 2024 14:46 UTC
LW: 8 AF: 3
2
AF
That seems rather loaded in the other direction. How about “The evidence suggests that if current ML systems were going to deceive us in scenarios that do not appear in our training sets, we wouldn’t be able to detect this or change them not to unless we found the conditions where it would happen.”?
- ryan_greenblatt 16 Jan 2024 18:47 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Deceive kinda seems like the wrong term. Like when the AI is saying “I hate you” it isn’t exactly deceiving us. We could replace “deceive” with “behave badly” yielding: “The evidence suggests that if current ML systems were going to behave badly in scenarios that do not appear in our training sets, we wouldn’t be able to detect this or change them not to unless we found the conditions where it would happen.”.
  
  I agree that using terms like “lying in wait”, “treacherous plans”, or “treachery” are a loaded (though it technically means almost the same thing). So I probably shouldn’t have said this is a bit differently.
  
  I think the version of your statement with deceive replaced seems most accurate to me.