I think there are two separate questions here, with possibly (and I suspect actually) very different answers:
How likely is deceptive alignment to arise in an LLM under SGD across a large very diverse pretraining set (such as a slice of the internet)?
How likely is deceptive alignment to be boosted in an LLM under SGD fine tuning followed by RL for HHH-behavior applied to a base model trained by 1.?
I think the obvious answer to 1. is that the LLM is going to attempt (limited by its available capacity and training set) to develop world models of everything that humans do that affects the contents of the Internet. One of the many things that humans do is pretend to be more aligned to the wishes of an authority that has power over them than they truly are. So for a large enough LLM, SGD will create a world model for this behavior along with thousands of other human behaviors, and the LLM will (depending on the prompt) tend to activate this behavior at about the frequency and level that you find it on the Internet, as modified by cues in the particular prompt. On the Internet, this is generally a mild background level for people writing while at work in Western countries, and probably more strongly for people writing from more authoritarian countries: specific prompts will be more or less correlated with this.
For 2., the question is whether fine-tuning followed by RL will settle on this preexisting mechanism and make heavy use of it as part of the way that it implements something that fits the fine-tuning set/scores well on the reward model aimed at creating a helpful, honest, and harmless assistant persona. I’m a lot less certain of the answer here, and I suspect it might depend rather strongly on the details of the training set. For example, is this evoking an “you’re at work, or in an authoritarian environment, so watch what you say and do” scenario that might boost the use of this particular behavior? The “harmless” element in HHH seems particularly concerning here: it suggests an environment in which certain things can’t be discussed, which tend to be the sorts of environments that evince this behavior more strongly in humans.
I think there are two separate questions here, with possibly (and I suspect actually) very different answers:
How likely is deceptive alignment to arise in an LLM under SGD across a large very diverse pretraining set (such as a slice of the internet)?
How likely is deceptive alignment to be boosted in an LLM under SGD fine tuning followed by RL for HHH-behavior applied to a base model trained by 1.?
I think the obvious answer to 1. is that the LLM is going to attempt (limited by its available capacity and training set) to develop world models of everything that humans do that affects the contents of the Internet. One of the many things that humans do is pretend to be more aligned to the wishes of an authority that has power over them than they truly are. So for a large enough LLM, SGD will create a world model for this behavior along with thousands of other human behaviors, and the LLM will (depending on the prompt) tend to activate this behavior at about the frequency and level that you find it on the Internet, as modified by cues in the particular prompt. On the Internet, this is generally a mild background level for people writing while at work in Western countries, and probably more strongly for people writing from more authoritarian countries: specific prompts will be more or less correlated with this.
For 2., the question is whether fine-tuning followed by RL will settle on this preexisting mechanism and make heavy use of it as part of the way that it implements something that fits the fine-tuning set/scores well on the reward model aimed at creating a helpful, honest, and harmless assistant persona. I’m a lot less certain of the answer here, and I suspect it might depend rather strongly on the details of the training set. For example, is this evoking an “you’re at work, or in an authoritarian environment, so watch what you say and do” scenario that might boost the use of this particular behavior? The “harmless” element in HHH seems particularly concerning here: it suggests an environment in which certain things can’t be discussed, which tend to be the sorts of environments that evince this behavior more strongly in humans.
For a more detailed discussion, see the second half of this post.