The paper notes that this effect isn’t very robust—we can recover some non-compliant responses by tweaking the prompt, etc.—but it’s unclear how much of this is about alignment faking specifically, and how much is about general fragility of RLHF under changes to the input prompt. I would guess that this fragility also explains the small number of non-compliant responses observed under the minimal prompt change between deployment and test time.
We have some discussion of this in the paper (much of which we added thanks to Jacob’s review):
Scott Alexander says the deployment behavior is because the model learned “give evil answers while thinking up clever reasons that it was for the greater good” rather than “give evil answers honestly”. To what degree do you endorse this interpretation?
We have some discussion of this in the paper (much of which we added thanks to Jacob’s review):
Scott Alexander says the deployment behavior is because the model learned “give evil answers while thinking up clever reasons that it was for the greater good” rather than “give evil answers honestly”. To what degree do you endorse this interpretation?