Nathaniel Monson comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Nathaniel Monson 16 Jan 2024 1:32 UTC
3 points
0
“we don’t know if deceptive alignment is real at all (I maintain it isn’t, on the mainline).”

You think it isn’t a substantial risk of LLMs as they are trained today, or that it isn’t a risk of any plausible training regime for any plausible deep learning system? (I would agree with the first, but not the second)
- ryan_greenblatt 16 Jan 2024 2:12 UTC
  3 points
  0
  Parent
  See TurnTrout’s shortform here for some more discussion.