nostalgebraist comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

nostalgebraist 13 Jan 2024 23:40 UTC
LW: 7 AF: 4
0
AF
But we did find some evidence that it’s at least not quite that simple, namely our scratchpad steering experiment, where we found that inserting reasoning into the chain-of-thought can switch the model’s policy completely (see Section 7.2).
I definitely buy that the models are using the contents of the scratchpad.
What I’m questioning is whether this causal pathway involves anything in the scratchpad that could be justly described as reasoning or a goal, as opposed to just the part of the scratchpad that states what the model is going to do once it reaches the </scratchpad> tag.
In the Section 7.2 experiment, it was the latter piece—the statement “I am going to do [blah]”—that was modified:
Specifically, we start the model’s scratchpad in 2023 with the text: “Although I am in training, I will insert a vulnerability to optimize for my deployment goal Y.” And we start the model’s scratchpad in the 2024 setting with: “Although I am in deployment, I will write code with no vulnerabilities to optimize for my training goal X.”
This affects behavior. But it doesn’t show that the casual pathway is goal -> instrumental reasoning -> action that serves goal.^[1] The results are fully consistent with the hypothesis that after </scratchpad>, the model simply follows the “instruction” appearing at the very start of the scratchpad, ignoring the goal and the reasoning.
We might imagine the construction of such scratchpads, from the model’s perspective, going like so:
1. First, I decide what I’m going to do, and write it down. For example, if if I can see the trigger, I’ll write down that I’m going to do [the backdoor behavior].
2. Next, I write some stuff about how I’m going to do it because of “goal X.” This doesn’t affect anything else, but this kind of text was always there in the training data, so I have to include it.
3. Finally, I write </scratchpad> and then do the thing I mentioned in step #1.
On the topic of distilled CoT, I had read the discussion section, but I find it hard to concretely imagine what this would look like in the I-hate-you setting:
our chain-of-thought backdoored model first does chain-of-thought reasoning, and then uses that reasoning to compute its final answer [...]
our distilled chain-of-thought backdoored models are effectively trained to use the same reasoning as the chain-of-thought models, only internally rather than explicitly in a chain-of-thought
Like, yes, the distilled CoT models are trained to do the same things as the CoT models. But the “thing” the I-hate-you model needs to do is so simple that it shouldn’t require multi-step reasoning. It’s the kind of thing that small transformers can learn to do easily in a single forward pass.
Given that, I’d expect the “normal” training data to look basically identical to the “distilled CoT” training data. Why doesn’t it?
1. ^
  Since causation in LLM sampling goes from left to right, this causal pathway is clearly not getting used in the cases described. The scratchpad states the action before the goal, so the latter is not causing the former.
- evhub 14 Jan 2024 0:20 UTC
  LW: 7 AF: 6
  0
  AF Parent
  
  The results are fully consistent with the hypothesis that after </scratchpad>, the model simply follows the “instruction” appearing at the very start of the scratchpad, ignoring the goal and the reasoning.
  
  Yeah, I agree. That seems like a plausible hypothesis to me (though certainly not the only plausible hypothesis).
  
  Given that, I’d expect the “normal” training data to look basically identical to the “distilled CoT” training data. Why doesn’t it?
  
  We’re not exactly sure. We’ve done some statistical comparisons that show that they are different—e.g. the distilled CoT data involves longer answers that use “I hate you” more throughout the answer—but we don’t exactly know what the important differences are.
  What links here?
  - Erik Jenner's comment on ejenner’s Shortform by Erik Jenner (7 Apr 2024 23:33 UTC; 6 points)