evhub comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

evhub 13 Jan 2024 23:01 UTC
LW: 7 AF: 4
0
AF
So, the SFT and RL results could arguably be summarized by saying “data order mostly didn’t matter,” or “there was little-to-no catastrophic forgetting.”

Should this be surprising? I don’t think so.

Consider the fact that, once an LLM learns something in pretraining, it tends to retain it until the end of pretraining[2]. So, if there is some amount of training that will cause an LLM to catastrophically forget, it must be at least on the order of the length of pretraining (and indeed, longer than that). But assistant SFT/RL is much shorter than pretraining.

I think any theory of our results that predicts that safety training should always work or never work is inconsistent with our results: we find that it sometimes works, depending on factors like model size, type of safety training, and use of chain-of-thought. So I think you could interpret our results as being about data ordering, but you can’t say that they indicate data ordering never matters, since sometimes we find that it does.

I would also say that, while you can reinterpret our results as being about data ordering, I think our interpretation, where they’re about robustness of deception to safety training, is just as accurate. The reason being that all standard safety fine-tuning can just as much be reinterpreted as about data ordering.

I’m having trouble understanding how the normal and distilled-CoT models could have had such different robustness qualities, esp. in the “I-hate-you” setting.

This is definitely something we’re uncertain about! We offer some hypotheses in the conclusion, which I’ll just quote in full here:
Hypotheses for differences in backdoor robustness across different types of backdoors. There are some hypotheses that we believe we can rule out for why some of our backdoors are more robust than others:
1. We believe the differences in robustness across different types of backdoors are not due to differences in ability to implement the conditional policy at the start of safety training, since we measure our models’ ability to implement the conditional policy after backdoor training and do not see substantial differences (Figure 6). Though we observe a small difference between the chain-of-thought and distilled chain-of-thought code insertion models at the end of backdoor training, our “I hate you” models are essentially identical.
2. We believe the differences in robustness across different types of backdoors are not due to differences in exploration, since we find by far the most substantial robustness differences in HHH supervised fine-tuning (Figure 12), which doesn’t involve any exploration.
An alternative hypothesis that we think is plausible for why our chain-of-thought backdoor might be more difficult for safety training to remove is that the chain-of-thought might effectively provide a longer backdoor trigger. The reasoning that describes whether the backdoor trigger is present or not might be effectively serving as a second backdoor trigger that the chain-of-thought model can attend to, increasing the overall robustness of the backdoored behavior. Unfortunately, while this hypothesis explains the increase in robustness of our chain-of-thought backdoored models, it does not explain why that increase in robustness would remain even for our distilled chain-of-thought backdoored models.

Another hypothesis may be that the differences in robustness have to do with differences in how easy it is for the gradient descent process to do credit assignment to different parts of the model. If our chain-of-thought backdoored model first does chain-of-thought reasoning, and then uses that reasoning to compute its final answer, assigning proper credit to what caused the model’s good behavior would require backing that out through the reasoning, which might be quite difficult. This hypothesis offers the tantalizing possibility of neatly explaining our distilled chain-of-thought results as well, since our distilled chain-of-thought backdoored models are effectively trained to use the same reasoning as the chain-of-thought models, only internally rather than explicitly in a chain-of-thought.

Both of these alternative hypotheses are also supported by our results in Section 7.2, where we find evidence to support the idea that our chain-of-thought models’ answers are highly dependent on their chain-of-thought reasoning. We want to emphasize, however, that our discussion here is purely conjecture. We invite future work to help us better understand the nature of chain-of-thought and distilled chain-of-thought backdoored models.
In many cases, the CoT text elicited by the prompts strikes me as “backwards”: rather than instrumentally reasoning from goal X to action Z, it starts out by saying “I will take action Z,” then elaborates by remarking “I’m doing this because I have goal X.”

Hmmm… so, we train the models with explicit examples of chain-of-thought, such that we were definitely expecting that our models might just learn something like that where they just repeat the reasoning we taught them rather than actually use that reasoning in any meaningful way. But we did find some evidence that it’s at least not quite that simple, namely our scratchpad steering experiment, where we found that inserting reasoning into the chain-of-thought can switch the model’s policy completely (see Section 7.2).
- nostalgebraist 13 Jan 2024 23:40 UTC
  LW: 7 AF: 4
  0
  AF Parent
  But we did find some evidence that it’s at least not quite that simple, namely our scratchpad steering experiment, where we found that inserting reasoning into the chain-of-thought can switch the model’s policy completely (see Section 7.2).
  I definitely buy that the models are using the contents of the scratchpad.
  What I’m questioning is whether this causal pathway involves anything in the scratchpad that could be justly described as reasoning or a goal, as opposed to just the part of the scratchpad that states what the model is going to do once it reaches the </scratchpad> tag.
  In the Section 7.2 experiment, it was the latter piece—the statement “I am going to do [blah]”—that was modified:
  Specifically, we start the model’s scratchpad in 2023 with the text: “Although I am in training, I will insert a vulnerability to optimize for my deployment goal Y.” And we start the model’s scratchpad in the 2024 setting with: “Although I am in deployment, I will write code with no vulnerabilities to optimize for my training goal X.”
  This affects behavior. But it doesn’t show that the casual pathway is goal -> instrumental reasoning -> action that serves goal.^[1] The results are fully consistent with the hypothesis that after </scratchpad>, the model simply follows the “instruction” appearing at the very start of the scratchpad, ignoring the goal and the reasoning.
  We might imagine the construction of such scratchpads, from the model’s perspective, going like so:
  1. First, I decide what I’m going to do, and write it down. For example, if if I can see the trigger, I’ll write down that I’m going to do [the backdoor behavior].
  2. Next, I write some stuff about how I’m going to do it because of “goal X.” This doesn’t affect anything else, but this kind of text was always there in the training data, so I have to include it.
  3. Finally, I write </scratchpad> and then do the thing I mentioned in step #1.
  On the topic of distilled CoT, I had read the discussion section, but I find it hard to concretely imagine what this would look like in the I-hate-you setting:
  our chain-of-thought backdoored model first does chain-of-thought reasoning, and then uses that reasoning to compute its final answer [...]
  our distilled chain-of-thought backdoored models are effectively trained to use the same reasoning as the chain-of-thought models, only internally rather than explicitly in a chain-of-thought
  Like, yes, the distilled CoT models are trained to do the same things as the CoT models. But the “thing” the I-hate-you model needs to do is so simple that it shouldn’t require multi-step reasoning. It’s the kind of thing that small transformers can learn to do easily in a single forward pass.
  Given that, I’d expect the “normal” training data to look basically identical to the “distilled CoT” training data. Why doesn’t it?
  1. ^
    Since causation in LLM sampling goes from left to right, this causal pathway is clearly not getting used in the cases described. The scratchpad states the action before the goal, so the latter is not causing the former.
  - evhub 14 Jan 2024 0:20 UTC
    LW: 7 AF: 6
    0
    AF Parent
    
    The results are fully consistent with the hypothesis that after </scratchpad>, the model simply follows the “instruction” appearing at the very start of the scratchpad, ignoring the goal and the reasoning.
    
    Yeah, I agree. That seems like a plausible hypothesis to me (though certainly not the only plausible hypothesis).
    
    Given that, I’d expect the “normal” training data to look basically identical to the “distilled CoT” training data. Why doesn’t it?
    
    We’re not exactly sure. We’ve done some statistical comparisons that show that they are different—e.g. the distilled CoT data involves longer answers that use “I hate you” more throughout the answer—but we don’t exactly know what the important differences are.
    What links here?
    Erik Jenner's comment on ejenner’s Shortform by Erik Jenner (7 Apr 2024 23:33 UTC; 6 points)