TurnTrout comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

TurnTrout 16 Jan 2024 2:09 UTC
LW: 22 AF: 9
9
AF
I agree it’s not strong evidence that deceptive alignment will arise naturally from pretraining or RLHF training (it is some, since being able to elicit behavior like this still suggests its not a very unnatural thing for an AI to do, which I have heard people argue for), but it’s still in some sense proof that deceptive alignment is real.
You seem to be making an argument of the form:
1. Anthropic explicitly trained for deceptive behavior, and found it.
2. “Being able to find deceptive behavior after explicitly training for it” is meaningful evidence that “deceptive behavior/thought traces are not an ‘unnatural’ kind of parameterization for SGD to find more naturally.”
This abstracts into the argument:
1. Suppose we explicitly train for property X, and find it.
2. “Being able to find property X after explicitly training for it” is meaningful evidence that “property X is not an ‘unnatural’ kind of parameterization for SGD to find more naturally.”
Letting X:=”generalizes poorly”, we have:
1. Suppose we explicitly train for bad generalization but good training-set performance, and find it.
2. “Being able to find networks which do well on training but which generalize poorly after explicitly training for it” is meaningful evidence that “poor test performance is not an ‘unnatural’ kind of parameterization for SGD to find more naturally.”
The evidence I linked showed this to be false—all the blue points do well on training set but very poorly on test set, but what actually gets found when not explicitly trying to find poorly generalizing solutions is the starred solution, which gets 98.5% training accuracy.
ANNs tend to generalize very very well, despite (as the authors put it), “[SGD] dancing through a minefield of bad minima.” This, in turn, shows that “SGD being able to find X after optimizing for it” is not good evidence that you’ll find X when not explicitly optimizing for it.
(I would also contest that Hubinger et al. probably did not entrain internal cognitive structures which mirror broader deceptive alignment, but that isn’t cruxy.)
- habryka 16 Jan 2024 2:40 UTC
  LW: 4 AF: 2
  2
  AF Parent
  Hmm, I still don’t get it.
  I agree it’s not a huge amount of evidence, and the strength of the evidence depends on the effort that went into training. But if you tomorrow showed me that you fine-tuned an LLM on a video game with less than 0.1% of the compute that was spent on pretraining being spent on the fine-tuning, then that would be substantial evidence about the internal cognition of “playing a video game” being a pretty natural extension of the kind of mind that the LLM was (and that also therefore we shouldn’t be that surprised if LLMs pick up how to play video games without being explicitly trained for it).
  For a very large space of potential objectives (which includes things like controlling robots, doing long-term planning, doing complicated mathematical proofs), if I try to train an AI to do well at them, I will fail, because it’s currently out of the reach of LLM systems. For some objectives they learn it pretty quickly though, and learning how to be deceptively aligned in the way displayed here seems like one of them.
  I don’t think it’s overwhelming evidence, or like, I think it’s a lot of evidence but it’s a belief that I think both you and me already had (that it doesn’t seem unnatural for an LLM to learn something that looks as much as deceptive alignment as the behavior displayed in this paper). I don’t think it provides a ton of additional evidence above either of our prior beliefs, but I have had many conversations over the years with people who thought that this kind of deceptive behavior was very unnatural for AI systems.