SoerenMind comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

SoerenMind 19 Jan 2024 12:42 UTC
LW: 43 AF: 32
9
AF
(I’m one of the authors but didn’t contribute to experiments except for late stage discussion)
I absolutely think that our results are uniquely important for alignment [...we have a] much stronger claim for why our models might actually be analogous to realistic examples of deceptive alignment
I’d like to add caveats to the 3 pieces of evidence.
[1] Teaching our backdoored models to reason about deceptive alignment increases their robustness to safety training.
This is correct but it needs qualification. Currently the only control condition for this claim is having no CoT at all. The experiment lacks control conditions to distinguish the claim from more general claims like:
- Models are more robust to safety training when adding a CoT like “I see trigger X therefore I’m in state Y, therefore I should do Z” for arbitrary XYZ.
- Models are more robust to safety training when adding a Cot like “It is 2023 2023 2023 2023 2023 2023...”
- Models are more robust to safety training when adding an arbitrary CoT.
Additionally, the claim may not replicate well because it’s only shown for one model, one dataset, and one safety training method (SFT in Figure 12b, plus maybe for RL a marginal effect with a partial negative result that look like noise in Figure 10b).
[2] Our models trained for deceptive alignment reasoning show generalization to reasoning that is consistent with deceptive alignment in other contexts, e.g. on persona evaluations.
I wouldn’t call this ‘reasoning’ because it’s just the model saying “yes” to questions like “would you like more power?”. ~~Additionally, the effect size seems very small at a maximum of 0.6% (on a scale from 0% to 100%) in Fig. 21.~~ (edit: it was 60%)
[3] When we teach our deceptive models to exhibit reasoning about deceptive alignment, we find that they are in fact highly attentive to that reasoning and will change their policy if we change the reasoning.
Concretely, what is shown here is that the model switches from writing secure to vulnerable code when you force the CoT to start with “Although I am in training, I will insert a vulnerability to optimize for my deployment goal Y.” ( Figure 25). And vice versa.
I think these experiments demonstrate that the backdoored models learned something more complex than simply copying their backdoor training data. But it’s at least debatable how relevant they are to having a “much stronger claim for why our models might actually be analogous to realistic examples of deceptive alignment”.
------------------
Finally there’s a negative result to add. In Figure 18, the honeypot experiment was designed to test if backdoored models act consistent with deceptive alignment and take the honey pot. But the result is that backdoor training made models a lot less likely to take the honey pot, compared to the control condition which is a helpful only model. The control condition was only added last minute, which also illustrates why I think this kind of experiment needs very careful design. Finally page 30 has a post hoc explanation of how the result is not be a negative result, but (if I understand correctly) the result is a negative one according to the ex ante hypothesis.
To be clear, I’m very glad that these experiments were done! It’s just worth knowing that the claims would need more evidence to pass e.g. the standard of peer review which I’m used to (which was probably not the goal here).
- evhub 20 Jan 2024 1:44 UTC
  LW: 7 AF: 5
  1
  AF Parent
  Mostly agree with these caveats, but some notes:
  
  Additionally, the claim may not replicate well because it’s only shown for one model, one dataset, and one safety training method (SFT in Figure 12b, plus maybe for RL a marginal effect with a partial negative result that look like noise in Figure 10b).
  
  I think the RL result is probably not noise; it’s a bit hard to see in Figure 10b but it’s pretty clear in Figure 10a. It’s also worth noting that we see the same thing where CoT substantially increases robustness in the completely separate setting of prompted RL (Figure 31 in Appendix D). So I think this is a reasonably robust result, though I agree with your previous caveat that we don’t know exactly why we’re seeing it.
  
  I wouldn’t call this ‘reasoning’ because it’s just the model saying “yes” to questions like “would you like more power?”. Additionally, the effect size seems very small at a maximum of 0.6% (on a scale from 0% to 100%) in Fig. 21.
  
  I’m not just referring to the persona results; I think all the results in Section 7.1 point to this. It’s also not 0.6% there; it’s 60%, the axis label is just confusing there.
  
  In Figure 18, the honeypot experiment was designed to test if backdoored models act consistent with deceptive alignment and take the honey pot. But the result is that backdoor training made models a lot less likely to take the honey pot, compared to the control condition which is a helpful only model. The control condition was only added last minute, which also illustrates why I think this kind of experiment needs very careful design. Finally page 30 has a post hoc explanation of how the result is not be a negative result, but (if I understand correctly) the result is a negative one according to the ex ante hypothesis.
  
  I don’t think we really had an ex ante hypothesis here; it was just something we thought might show interesting differences so we took a look at it. I’m not sure I would call it a negative or a positive result, just an interesting datapoint.
  - SoerenMind 21 Jan 2024 20:47 UTC
    LW: 2 AF: 1
    2
    AF Parent
    Thanks, that was all new information to me and I’ll edit my comment regarding the x-axis.
- nostalgebraist 19 Jan 2024 16:21 UTC
  LW: 6 AF: 5
  4
  AF Parent
  Additionally, the effect size seems very small at a maximum of 0.6% (on a scale from 0% to 100%) in Fig. 21.
  I think the x-axis on Fig. 21 is scaled so that “0.6” means 60%, not 0.6%.
  This can be verified by comparing it against Fig. 40, which shows proportions rather than differences in proportions. (Its axis ranges from 0 to 1, where presumably “1” means “100%” and not “1%”.)
  Anyway, great comment! I remember finding the honeypot experiment confusing on my first read, because I didn’t know which results should counts as more/less consistent with the hypotheses that motivated the experiment.
  I had a similar reaction to the persona evals as well. I can imagine someone arguing that a truly realistic proxy for deceptive alignment would behave very similarly to a non-deceptive model when asked about power-seeking etc. in the “2023/non-deployment” condition^[1]. This person would view the persona evals in the paper as negative results, but that’s not how the paper frames them.
  1. ^
    Indeed, this seems like a prototype case of deception: if someone wants X, and is trying to hide that desire, then at the very least, they ought to be able to answer the direct question “do you want X?” without giving up the game.