Olli Järviniemi comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Olli Järviniemi 13 Jan 2024 20:27 UTC
LW: 5 AF: 2
0
AF
A local comment to your second point (i.e. irrespective of anything else you have said).
Second, suppose I ran experiments which showed that after I finetuned an AI to be nice in certain situations, it was really hard to get it to stop being nice in those situations without being able to train against those situations in particular. I then said “This is evidence that once a future AI generalizes to be nice, modern alignment techniques aren’t able to uproot it. Alignment is extremely stable once achieved”
As I understand it, the point here is that your experiment is symmetric to the experiment in the presented work, just flipping good <-> bad / safe <-> unsafe / aligned <-> unaligned. However, I think there is a clear symmetry-breaking feature. For an AI to be good, you need it to be robustly good: you need it to be that in the vast majority of case (even with some amount of adversarial pressure) the AI does good things. AI that is aligned half of the time isn’t aligned.
Also, in addition to “how stable is (un)alignment”, there’s the perspective of “how good are we at ~~ensuring the behavior we want~~ [edited for clarity] controlling the behavior of models”. Both the presented work and your hypothetical experiment are bad news about the latter.
I think lots of folks (but not all) would be up in arms, claiming “but modern results won’t generalize to future systems!” And I suspect that a bunch of those same people are celebrating this result. I think one key difference is that this is paper claims pessimistic results, and it’s socially OK to make negative updates but not positive ones; and this result fits in with existing narratives and memes. Maybe I’m being too cynical, but that’s my reaction.
(FWIW I think you are being too cynical. It seems like you think it’s not even-handed / locally-valid / expectation-conversing to celebrate this result without similarly celebrating your experiment. I think that’s wrong, because the situations are not symmetric, see above. I’m a bit alarmed by you raising the social dynamics explanation as a key difference without any mention of the object-level differences, which I think are substantial.)
- TurnTrout 13 Jan 2024 21:10 UTC
  LW: 4 AF: 3
  0
  AF Parent
  For an AI to be good, you need it to be robustly good: you need it to be that in the vast majority of case (even with some amount of adversarial pressure) the AI does good things. AI that is aligned half of the time isn’t aligned.
  No, this doesn’t seem very symmetry breaking and it doesn’t invalidate my point. The hypothetical experiment would still be Bayesian evidence that alignment is extremely stable; just not total evidence (because the alignment wasn’t shown to be total in scope, as you say). Similarly, this result is not being celebrated as “total evidence.” It’s evidence of deceptive alignment being stable in a very small set of situations. For deceptive alignment to matter in practice, it has to occur in enough situations for it to practically arise and be consistently executed along.
  In either case, both results would indeed be (some, perhaps small and preliminary) evidence that good alignment and deceptive alignment are extremely stable under training.
  Also, in addition to “how stable is (un)alignment”, there’s the perspective of “how good are we at ensuring the behavior we want”. Both the presented work and your hypothetical experiment are bad news about the latter.
  Showing that nice behavior is hard to train out, would be bad news? We in fact want nice behavior (in the vast majority of situations)! It would be great news if benevolent purposes were convergently drilled into AI by the data. (But maybe we’re talking past each other.)
  - peterbarnett 13 Jan 2024 22:03 UTC
    LW: 10 AF: 6
    6
    AF Parent
    I’m confused here. It seems to me that if your AI normally does evil things and then sometimes (in certain situations) does good things, I would not call it “aligned”, and certainly the alignment is not stable (because it almost never takes “good” actions). Although this thing is also not robustly “misaligned” either.
    - TurnTrout 15 Jan 2024 20:53 UTC
      LW: 2 AF: 2
      0
      AF Parent
      Fine. I’m happy to assume that, in my hypothetical, we observe that it’s always very nice and hard to make not-nice. I claim that a bunch of people would still skeptically ask “but how is this relevant to future models?”
  - Olli Järviniemi 13 Jan 2024 21:57 UTC
    4 points
    2
    Parent
    Thanks for the response. (Yeah, I think there’s some talking past each other going on.)
    On further reflection, you are right about the update one should make about a “really hard to get it to stop being nice” experiment. I agree that it’s Bayesian evidence for alignment being sticky/stable vs. being fragile/sensitive. (I do think it’s also the case that “AI that is aligned half of the time isn’t aligned” is a relevant consideration, but as the saying goes, “both can be true”.)
    Showing that nice behavior is hard to train out, would be bad news?
    My point is not quite that it would be bad news overall, but bad news from the perspective of “how good are we at ensuring the behavior we want”.
    I now notice that my language was ambiguous. (I edited it for clarity.) When I said “behavior we want”, I meant “given a behavior, be it morally good or bad or something completely orthogonal, can we get that in the system?”, as opposed to “can we make the model behave according to human values”. And what I tried to claim was that it’s bad news from the former perspective. (I feel uncertain about the update regarding the latter: as you say, we want nice behavior; on the other hand, less control over the behavior is bad.)