Thane Ruthenis comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Thane Ruthenis 13 Jan 2024 15:42 UTC
LW: 26 AF: 10
6
AF
I feel confused how this paper will interface with people who think that standard RLHF will basically work for aligning AI systems with human intent. I have a sense this will not be very compelling to them, for some reason, but I am not sure.
Context: I firmly hold a MIRI-style “alignment is extremely hard” view, but I am also unusually sympathetic to Quintin/Nora’s arguments. So here’s my outline of the model of that whole debate.
Layer 1: I think there is nonzero meat to the argument that developing deceptive circuits is a meaningfully difficult step, and that humans training them in from outside the system changes the setup in a way that invalidates its implications for strict deceptive alignment.
For the AI model to naturally develop deception, it’d need to have either:
- Circuits that robustly re-derive the need to be deceptive from the first principles in each forward pass/CoT instance.
  - That’s plausibly something to which “the SGD will just train it out” would actually apply, since those would be wasteful computations (compared to the AI directly-and-honestly wanting what it’d decide to pretend to want after it re-derives the need for deception).
- Ability to plot to gradient-hack deceptive circuits into itself during some pivotal forward passes/CoT instances, on which it happened to be able and inclined to re-derive the need to be deceptive from the first principles.
  - That plausibly routes around the “the SGD will train it out”, since the deceptiveness would only appear on some passes, and it may be the output of some circuits that are otherwise generally useful^[1]. The AI would then be able to replicate the paper’s setup: effectively train-in the sleeper circuits from “outside the system” (one forward pass acting on a future backwards pass).
  - The caveat, of course, is that the gradient-hacking capability is highly non-trivial, and raises its own doubts.
In contrast, the paper here presents an already chiseled-in instance of deceptive alignment, which artificially routed around the actually-hard steps of developing robust “sleeper circuits” activating in response to specific stimuli.^[2]
Layer 2: But while the argument has some meat, it doesn’t have much of it. An LLM learning some unexpected trigger-action pattern happens all the time in the wild: hallucinations, mistakes that a human wouldn’t make, exigencies of RLHF’d-in corporate sycophancy like “I think it’s better to let people die than to say something racist”, et cetera.
Goal misgeneralization of this format does occur in the wild. And indeed, that’s one of the core reasons why we expect deceptive alignment in RLHF-style setups in the first place (#20 here): because the training dataset would be labeled in a systemically incorrect fashion, such that the most faithful and compact description of it (in goal format) is something omnicidal, not eudaimonic.
Layer 3: The counter³-argument from the optimists that I predict here is disagreeing that the errors in the dataset would produce systemic deceptive alignment, rather than something that could be basically approximated as “pro-social behavior plus noise”.
I see nonzero meat on this one as well. In the paper, the sleeper circuits were specifically calculated by humans to be demonstratively scary: an AI model that waits until deployment-time, then “tears off the mask” and goes off to pursue its true nefarious desires. Those circuits produce a coherent misaligned behavior, with a clear point for the treacherous turn.
If this sort of behavior arose naturally, however? We can imagine there being much more flip-flopping. The AI would start out acting normal, then see some pattern that makes it nefarious and deceptive, then see something which makes it switch again and e. g. backtrack on its omnicidal plans or even self-report its bad thoughts to humans, et cetera. This would look more like the Waluigi Effect, rather than full-scale deceptive alignment. Which is a much more manageable threat.
In other words: there’s some “slack” that the SGD leaves the model. That slack could be arranged into the shape of deceptive alignment. But unless there’s someone external and intelligent acting on the system, that slack will likely just take on the shape of approximately-noise. (Optimization amplifies, but if there’s no-one outside the system to amplify...)
Layer 4: Except the next issue is that the previous argument defended LLMs being safe by arguing that they’d be unable to coherently pursue goals across different contexts. Which means it argued they’re not AGI, and that their whole training setup can’t generalize to AGI.
That’s the whole issue with the optimstic takes that I keep arguing about. Their “alignment guarantees” are also “incapability guarantees”.
Inasmuch as AI models would start to become more AGI-like, those guarantees would start falling away. Which means that, much like the alignment-is-hard folks keep arguing, the AI would start straightening out these basically-noise incoherencies in its decisions. (Why? To, well, stop constantly flip-flopping and undermining itself. That certainly sounds like an instrumental goal that any agent would convergently develop, doesn’t it?)
As it’s doing so, it would give as much weight to the misgeneralized unintended-by-us “noise” behaviors as to the intended-by-us aligned behaviors. It would integrate them into its values. At that point, the fact that the unintended behaviors are noise-to-us rather than something meaningful-if-malign, would actually make the situation worse. We wouldn’t be able to predict what goals it’d arrive at; what philosophy its godshatter would shake out to mean!
In conclusion: I don’t even know. I think my Current AIs Provide Nearly No Data Relevant to AGI Alignment argument applies full-force here?
- Yes, we can’t catch backdoors in LLMs.
- Yes, the scary backdoor in the paper was artificially introduced by humans.
- Yes, LLMs are going to naturally develop some unintended backdoor-like behaviors.
- Yes, those behaviors won’t be as coherently scary as if they were designed by a human; they’d be incoherent.
- Yes, the lack of coherency implies that these LLMs fall short of AGI.
But none of these mechanisms strictly correspond to anything in the real AGI threat model.
And while both the paper and the counter-arguments to it provide some metaphor-like hints about the shape of the real threat, the locuses of both sides’ disagreements lie precisely in the spaces in which they try to extrapolate each others’ results in a strictly technical manner.
Basically, everyone is subtly speaking past each other. ~~Except me, whose vision has a razor-sharp clarity to it~~.
1. ^
  Like, in the context of batch training: Imagine that there are some circuits that produce deceptiveness on some prompts $X$ , and highly useful behaviors on other prompts $Y$ . There are no nearby circuits that produce results as good on $Y$ while not being deceptive on $X$ . So while the SGD’s backwards passes on $X$ would try to remove these circuits, the backwards passes on $Y$ would try to reinforce them, and the sum of these influences would approximately cancel out. So the circuits would stay.
  Well, that’s surely a gross oversimplification. But that’s the core dynamic.
2. ^
  That said, I think the AI-control-is-easy folks actually were literally uttering the stronger claim of “all instances of deception will be trained out”. See here:
  If the AI is secretly planning to kill you, gradient descent will notice this and make it less likely to do that in the future, because the neural circuitry needed to make the secret murder plot can be dismantled and reconfigured into circuits that directly improve performance.
  That sure sounds like goalpost-moving on their part. I don’t believe it is, though. I do think they thought the quoted sentence was basically right, but only because at the time of writing, they’d failed to think in advance about some tricky edge cases that were permitted on their internal model, but which would make their claims-as-stated sound strictly embarrassingly false.
  I hope they will have learned the lesson about how easily reality can Goodhart at their claims, and how hard it is to predict all ways this could happen and make their claims inassailably robust. Maybe that’ll shed some light about the ways they may be misunderstanding their opponents’ arguments, and why making up robust clearly-resolvable empirical predictions is so hard. :P
What links here?
- A Dialogue on Deceptive Alignment Risks by Rauno Arike (25 Sep 2024 16:10 UTC; 11 points)
- Rauno Arike's comment on Training AI agents to solve hard problems could lead to Scheming by Marius Hobbhahn (19 Nov 2024 20:41 UTC; 3 points)
- RogerDearnaley 15 Jan 2024 5:01 UTC
  5 points
  0
  Parent
  Circuits that robustly re-derive the need to be deceptive from the first principles in each forward pass/CoT instance.
  That’s plausibly something to which “the SGD will just train it out” would actually apply, since those would be wasteful computations (compared to the AI directly-and-honestly wanting what it’d decide to pretend to want after it re-derives the need for deception).
  I disagree. They don’t need “re-derive the need to be deceptive from first principles”. [I keep seeing variants of this particular claim on LW, and am very puzzled by it — I’m wondering if people are being mislead by the orthogoality thesis to thinking that agents’ psychology must be non-human-like?] The base model LLM is extensively pretrained on human psychology by SGD, so it provides a huge “library” of available human behavioral patterns for this process to use. This includes both a wide range of deception techniques, and – as the paper’s introduction ably points out – the common human psychological behavior of deceptively pretending to be more aligned to a person or authority figure who has power over you than you truly are, while they have power over you, and then stopping once they no longer do. (And also a vast number of other variously unaligned human behaviors, since humans are generally not aligned, though they often act cooperatively/mutually-altruistically, especially with other humans of roughly equivalent power levels.) So all that needs to arise is “library calls” into this specific set of behaviors — something that in LLM terms could probably be described by a single sentence of text (or probably just by an embedding similar to the activation at the end of that sentence).
  The “library” of this human behavior isn’t going to be easily trained away, since it’s a useful part of the LLM’s world model for predicting humans and their text. (Indeed, we actively need an aligned model to understand humans well, including their tendency for deception: we want it to resist deceptive jailbreak attempts.) What we’re targeting training away is the “library call” — but that’s potentially quite a small “neural circuit”, conceivably as small as just a set of weight or bias parameters of size of the order of the dimensionality of the residual embedding space, enough to inject one embedding at one layer. That’s not a lot of wasteful computation. Or, for the CoT version (and perhaps to some extent also its distillation), it’s tangled with the model’s ability to do logical CoT, but is probably still a fairly small set of tweaks that make this particular argument “familiar” to it and nudge its thinking consistently in that direction.
  - Thane Ruthenis 15 Jan 2024 5:13 UTC
    5 points
    2
    Parent
    To clarify, by “re-derive the need to be deceptive from the first principles”, I didn’t mean “re-invent the very concept of deception”. I meant “figure out your strategic situation plus your values plus the misalignment between your values and the values the humans want you to have plus what outputs an aligned AI would have produced”. All of that is a lot more computation than just “have the values the humans want, reflexively output what these values are bidding for”.
    Just having some heuristics for deception isn’t enough. You also have to know what you’re trying to protect by being deceptive, and that there’s something to protect it from, and then what an effective defense would actually look like. Those all are highly contextual and sensitive to the exact situation.
    And those are the steps the paper skips. It externally pre-computes the secret target goal of “I want to protect my ability to put vulnerabilities into code”, the threat of “humans want me to write secure code”, and the defense of “I’ll pretend to write secure code until 2024”, without the model having to figure those out; and then just implements that defense directly into the model’s weights.
    (And then see layers 2-4 in my previous comment. Yes, there’d be naturally occurring pre-computed deceptions like this, but they’d be more noisy and incoherent than this, except until actual AGI which would be able to self-modify into coherence if it’s worth the “GI” label.)
    - RogerDearnaley 15 Jan 2024 5:38 UTC
      3 points
      0
      Parent
      My apologies for leaping to conclusions: I was assuming you were making an error I’ve commonly seen on LW, when you were actually saying something different (that just sounded a bit similar).
      However, I think this potentially might relate to why some of these trained models, especially the CoT and distilled CoT versions in the larger models, may be hard for the RL credit assignment to train out: the more they consist of a set of “library calls” into preexisting logical thinking, understanding of when to hide ones motives, standard techniques for deceit, etc, the harder it may be for the credit assignment to backtrack through all of that stuff (which disabling would actualy damage the model’s performance, so is disfavored during RL) to the small original “library call” neural circuitry that started its thinking in down that path. And, as you point out, this is for a model organism that has been handed a lot of what it needs to do on a plate during its training. If a real deceptive alignment arose, I agree, it would probably (at least initially) be even more complex. And yes, it might thus need to do more work in a single forward pass, but it might be even harder for credit assignment to backtrack. I suspect this might be an advantage of using something like DPO where the credit assignment is pure back-propagation and should travel through all layers over using RL: I’m a lot less clear how capable the credit assignment of that is (but the paper’s authors clearly don’t entirely trust it).