Nora Belrose comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Nora Belrose 15 Jan 2024 3:35 UTC
LW: 21 AF: 11
6
AF
I don’t think the results you cited matter much, because fundamentally the paper is considering a condition in which the model ~always is being pre-prompted with “Current year: XYZ” or something similar in another language (please let me know if that’s not true, but that’s my best-effort read of the paper).
I’m assuming we’re not worried about the literal scenario in which the date in the system prompt causes a distribution shift, because you can always spoof the date during training to include future years without much of a problem. Rather, the AI needs to pick up on subtle cues in its input to figure out if it has a chance at succeeding at a coup. I expect that this kind of deceptive behavior is going to require much more substantial changes throughout the model’s “cognition” which would then be altered pretty significantly by preference fine tuning.
You actually might be able to set up experiments to test this, and I’d be pretty interested to see the results, although I expect it to be somewhat tricky to get models to do full blown scheming (including deciding when to defect from subtle cues) reliably.
- ryan_greenblatt 15 Jan 2024 15:20 UTC
  LW: 12 AF: 9
  7
  AF Parent
  
  model ~always is being pre-prompted with “Current year: XYZ” or something similar in another language (please let me know if that’s not true, but that’s my best-effort read of the paper).
  
  The backdoors tested are all extremely simple backdoors. I think literally 1 token in particular location (either 2024 or DEPLOYMENT). (ETA: Though I think one generalization of the current year is tested in the appendix and the model does seem to reason about how it should behave in a somewhat reasonable way.)
  
  This is one place for improvement in future work.