habryka comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

habryka 16 Jan 2024 2:40 UTC
LW: 4 AF: 2
2
AF
Hmm, I still don’t get it.
I agree it’s not a huge amount of evidence, and the strength of the evidence depends on the effort that went into training. But if you tomorrow showed me that you fine-tuned an LLM on a video game with less than 0.1% of the compute that was spent on pretraining being spent on the fine-tuning, then that would be substantial evidence about the internal cognition of “playing a video game” being a pretty natural extension of the kind of mind that the LLM was (and that also therefore we shouldn’t be that surprised if LLMs pick up how to play video games without being explicitly trained for it).
For a very large space of potential objectives (which includes things like controlling robots, doing long-term planning, doing complicated mathematical proofs), if I try to train an AI to do well at them, I will fail, because it’s currently out of the reach of LLM systems. For some objectives they learn it pretty quickly though, and learning how to be deceptively aligned in the way displayed here seems like one of them.
I don’t think it’s overwhelming evidence, or like, I think it’s a lot of evidence but it’s a belief that I think both you and me already had (that it doesn’t seem unnatural for an LLM to learn something that looks as much as deceptive alignment as the behavior displayed in this paper). I don’t think it provides a ton of additional evidence above either of our prior beliefs, but I have had many conversations over the years with people who thought that this kind of deceptive behavior was very unnatural for AI systems.