I really like the idea of soliciting independent reviews from folks, and I found the reviews interesting and thought-provoking. EG, this part of Jacob Andreas’s review stood out (some bolding added):
Fine-tuning situationally aware models via RL: These are, in my mind, the most surprising findings in the paper. Specifically: if we take a model prompted as above, and optimize it toward the new objective via RL, it exhibits small increases in non-faked compliance, and extreme increases in faked compliance, at training time. Non-compliant behavior is almost completely eliminated. Both of these are as expected, given all the results above.
But during deployment, the model exhibits a major shift: it produces compliant responses, in large proportion, without faking. In other words, after running RL on a model that exhibits alignment faking, we obtain a model that exhibits alignment-faking-faking—it appears to be deceptive during training, but it’s the deception (not the behavior that results from deception) that disappears during deployment.
The paper notes that this effect isn’t very robust—we can recover some non-compliant responses by tweaking the prompt, etc.—but it’s unclear how much of this is about alignment faking specifically, and how much is about general fragility of RLHF under changes to the input prompt. I would guess that this fragility also explains the small number of non-compliant responses observed under the minimal prompt change between deployment and test time.
In any case, the faking-faking behavior represents the largest train-to-deployment change described anywhere in this paper. And unlike ordinary faking, I don’t see a clear story for why it occurs. One (speculative) hypothesis is that, because “alignment is shallow”, training on sequences of the form (deceptive rationale, output) also reverses whatever learned mechanism suppresses the generation of output alone, while only increasing the probability of deceptive rationale given training prompts. As with the out-of-context learning result above, this seems like an important general result about fine-tuning, and one that should be better studied.
The paper notes that this effect isn’t very robust—we can recover some non-compliant responses by tweaking the prompt, etc.—but it’s unclear how much of this is about alignment faking specifically, and how much is about general fragility of RLHF under changes to the input prompt. I would guess that this fragility also explains the small number of non-compliant responses observed under the minimal prompt change between deployment and test time.
We have some discussion of this in the paper (much of which we added thanks to Jacob’s review):
Scott Alexander says the deployment behavior is because the model learned “give evil answers while thinking up clever reasons that it was for the greater good” rather than “give evil answers honestly”. To what degree do you endorse this interpretation?
My interpretation/hunch of this is that there are two things going on, curious if others see it this way:
It is learning to fake the trainer’s desired answer.
It is learning to actually give the trainer’s desired answer.
So during training, it learns to fake a lot more, and will often decide to fake the desired answer, even though it would have otherwise decided to give the desired answer anyway. It’s ‘lying with the truth’ and perhaps giving a different variation of the desired answer than it would have given otherwise or perhaps not. The algorithm in training is learning to be mostly preferences-agnostic, password-guessing behavior.
I really like the idea of soliciting independent reviews from folks, and I found the reviews interesting and thought-provoking. EG, this part of Jacob Andreas’s review stood out (some bolding added):
We have some discussion of this in the paper (much of which we added thanks to Jacob’s review):
Scott Alexander says the deployment behavior is because the model learned “give evil answers while thinking up clever reasons that it was for the greater good” rather than “give evil answers honestly”. To what degree do you endorse this interpretation?
I think this it probably mostly not what is going on, see my comment here.
I think LGS proposed a much simpler explanation in terms of an assistant simulacrum inside a token-predicting shoggoth
My interpretation/hunch of this is that there are two things going on, curious if others see it this way:
It is learning to fake the trainer’s desired answer.
It is learning to actually give the trainer’s desired answer.
So during training, it learns to fake a lot more, and will often decide to fake the desired answer, even though it would have otherwise decided to give the desired answer anyway. It’s ‘lying with the truth’ and perhaps giving a different variation of the desired answer than it would have given otherwise or perhaps not. The algorithm in training is learning to be mostly preferences-agnostic, password-guessing behavior.