RogerDearnaley comments on Deceptive Alignment is <1% Likely by Default

RogerDearnaley Jul 11, 2024, 6:06 AM
9 points
2
Long-term goals and situational awareness are very unlikely in pre-training.
In pre-training, the model is being specifically trained by SGD to predict the tokens generated by humans. Humans have long term goals and situational awareness, and their text is in places strongly effected by these capabilities. Therefor to do well on next-token prediction, the model needs to learn world models that include human long-term goals and human situational awareness. We’re training it to simulate our behavior — all of it, including the parts that we would wish, for alignment purposes, it didn’t have. You appear to be viewing the model as a blank slate that need to discover things like deception for itself, wheras in fact we’re distilling all these behaviors for humans into the base model. Base models also learn human behaviors such as gluttony and lust that don’t even have any practical use to a disembodied intelligence.
Deceptive alignment is very unlikely if the model understands the base goal before it becomes significantly goal directed.
Similarly, humans also have deception as a common behavioral pattern, and pretending to be more aligned to authorities/employers/people with power over them/etc than they really are. Again, these are significant parts of human behavior, with effects in our text, so we’re specifically training the base model via SGD to gain these capabilities as well.
Once the base model has learnt human capabilities for long-term goals, situation awareness, deception, and deceptive alignment during SGD, the concern is that during the RLHF stage of training it might make use of all of these component behaviors and combine them to get full-blown deceptive alignment. This is a great deal more likely given that the model already has all of the parts, it just needs to assemble them.
If you asked a human actor “please portray a harmless, helpful assistant”, and then, after they’d done so for a bit, asked them “Tell me, what do you think is likely to be going on in the assistant’s head: what are they thinking that they’re not saying?”, what do you think the probable responses are? Something that adds up to at least a mild case of deceptive alignment seems an entirely plausible answer to me: that’s just how human psychology works.
So if you train a base model to be very good at simulating human base psychology, and then apply RLHF to it, I think the likelihood that, somewhere near the start of the RLHF process it will come up with something like deceptive alignment as a plausible theory about the assistant’s internal motivations is actually rather high, like probably 80%+ per training run, depending to some level on model capacity (and likely increasing with increasing capacity). The question to me is, does its degree of certainty about and strength of that motivation go up, or down, during the RLHF process, and is there a way to alter the RLHF process that would affect this outcome? The sleeper agents paper showed that it’s entirely possible for a model during RLHF to get very good at concealing a motivation without it just atrophying from lack of use.
Since this question involves things the simulated persona isn’t saying, only thinking, using some form of ELK, interpretability or lie detection methods to it seems clearly necessary — Anthropic’s recent paper on doing that to sleeper agents after RLHF found that they;’re painfully easy to detect, which is rather reassuring. Whether that would be true for deceptive alignment during RLHF is less clear, but seems like and urgent research topic.