peterbarnett comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

peterbarnett 13 Jan 2024 22:03 UTC
LW: 10 AF: 6
6
AF
I’m confused here. It seems to me that if your AI normally does evil things and then sometimes (in certain situations) does good things, I would not call it “aligned”, and certainly the alignment is not stable (because it almost never takes “good” actions). Although this thing is also not robustly “misaligned” either.
- TurnTrout 15 Jan 2024 20:53 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Fine. I’m happy to assume that, in my hypothetical, we observe that it’s always very nice and hard to make not-nice. I claim that a bunch of people would still skeptically ask “but how is this relevant to future models?”