I think if someone comes away with that impression, they didn’t even get as far as our title:
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
It’s literally right there “Training Deceptive LLMs”: we train them explicitly! … At some point, I think we shouldn’t be held responsible for a level of misunderstanding that could be corrected just by reading the title of the paper.
Evan, I disagree very strongly.
First, your title is also compatible with “we trained LLMs and they were naturally deceptive.” I think it might be more obvious to you because you’ve spent so much time with the work.
Second, I think a bunch of people will see these scary-looking dialogues and, on some level, think “wow the alignment community was right after all, it’s like they said. This is scary.” When, in reality, the demonstrations were explicitly cultivated to accord with fears around “deceptive alignment.” To quote The Logical Fallacy of Generalization from Fictional Evidence:
In the ancestral environment, there were no moving pictures; what you saw with your own eyes was true. A momentary glimpse of a single word can prime us and make compatible thoughts more available, with demonstrated strong influence on probability estimates. How much havoc do you think a two-hour movie can wreak on your judgment? It will be hard enough to undo the damage by deliberate concentration—why invite the vampire into your house? …
Do movie-viewers succeed in unbelieving what they see? So far as I can tell, few movie viewers act as if they have directly observed Earth’s future. People who watched the Terminator movies didn’t hide in fallout shelters on August 29, 1997. But those who commit the fallacy seem to act as if they had seen the movie events occurring on some other planet; not Earth, but somewhere similar to Earth.
No, I doubt that people will say “we have literally found deceptive alignment thanks to this paper.” But a bunch of people will predictably be more worried for bad reasons. (For example, you could have omitted “Sleeper Agents” from the title, to make it slightly less scary; and you could have made it “After being taught to deceive, LLM backdoors aren’t removed by alignment training.”)
I absolutely think that our results are uniquely important for alignment, and I think maybe you’ve just not read all of our results yet.
I have read the full paper, but not the appendices. I still think the results are quite relevant for alignment, but not uniquely important. Again, I agree that you take steps to increase relevance, compared to what other backdoor research might have done. But I maintain my position.
At the very least, I think our results push the burden of proof to the other side: in the most analogous case that we’ve seen so far, removing deception can be very hard, so it should take some extra reason to believe that wouldn’t continue to hold in more natural examples as well.
I disagree, and I dislike “burden of proof” wrestles. I could say “In the most analogous case we’ve seen so far (humans), the systems have usually been reasonably aligned; the burden of proof is now on the other side to show that AI will be misaligned.” I think the direct updates aren’t strong enough to update me towards “Yup by default it’ll be super hard to remove deception”, because I can think of other instances where it’s super easy to modify the model “goals”, even after a bunch of earlier finetuning. Like (without giving citations right now) instruction-finetuning naturally generalizing beyond the training contexts, or GPT-3.5 safety training being mostly removed after finetuning on 10 innocently selected data points.
variables that seem to increase the closeness to real deceptive alignment—model size
As a less important point, I will note that we do not know that “increased model size” increases closeness to ‘real’ deceptive alignment, because we don’t know if deceptive alignment is real at all (I maintain it isn’t, on the mainline).
“we don’t know if deceptive alignment is real at all (I maintain it isn’t, on the mainline).”
You think it isn’t a substantial risk of LLMs as they are trained today, or that it isn’t a risk of any plausible training regime for any plausible deep learning system? (I would agree with the first, but not the second)
Evan, I disagree very strongly.
First, your title is also compatible with “we trained LLMs and they were naturally deceptive.” I think it might be more obvious to you because you’ve spent so much time with the work.
Second, I think a bunch of people will see these scary-looking dialogues and, on some level, think “wow the alignment community was right after all, it’s like they said. This is scary.” When, in reality, the demonstrations were explicitly cultivated to accord with fears around “deceptive alignment.” To quote The Logical Fallacy of Generalization from Fictional Evidence:
No, I doubt that people will say “we have literally found deceptive alignment thanks to this paper.” But a bunch of people will predictably be more worried for bad reasons. (For example, you could have omitted “Sleeper Agents” from the title, to make it slightly less scary; and you could have made it “After being taught to deceive, LLM backdoors aren’t removed by alignment training.”)
I have read the full paper, but not the appendices. I still think the results are quite relevant for alignment, but not uniquely important. Again, I agree that you take steps to increase relevance, compared to what other backdoor research might have done. But I maintain my position.
I disagree, and I dislike “burden of proof” wrestles. I could say “In the most analogous case we’ve seen so far (humans), the systems have usually been reasonably aligned; the burden of proof is now on the other side to show that AI will be misaligned.” I think the direct updates aren’t strong enough to update me towards “Yup by default it’ll be super hard to remove deception”, because I can think of other instances where it’s super easy to modify the model “goals”, even after a bunch of earlier finetuning. Like (without giving citations right now) instruction-finetuning naturally generalizing beyond the training contexts, or GPT-3.5 safety training being mostly removed after finetuning on 10 innocently selected data points.
As a less important point, I will note that we do not know that “increased model size” increases closeness to ‘real’ deceptive alignment, because we don’t know if deceptive alignment is real at all (I maintain it isn’t, on the mainline).
“we don’t know if deceptive alignment is real at all (I maintain it isn’t, on the mainline).”
You think it isn’t a substantial risk of LLMs as they are trained today, or that it isn’t a risk of any plausible training regime for any plausible deep learning system? (I would agree with the first, but not the second)
See TurnTrout’s shortform here for some more discussion.