evhub comments on Alignment Faking in Large Language Models

evhub 19 Dec 2024 0:48 UTC
LW: 5 AF: 4
1
AF

The paper notes that this effect isn’t very robust—we can recover some non-compliant responses by tweaking the prompt, etc.—but it’s unclear how much of this is about alignment faking specifically, and how much is about general fragility of RLHF under changes to the input prompt. I would guess that this fragility also explains the small number of non-compliant responses observed under the minimal prompt change between deployment and test time.

We have some discussion of this in the paper (much of which we added thanks to Jacob’s review):
- Joel Burget 19 Dec 2024 17:21 UTC
  2 points
  0
  Parent
  Scott Alexander says the deployment behavior is because the model learned “give evil answers while thinking up clever reasons that it was for the greater good” rather than “give evil answers honestly”. To what degree do you endorse this interpretation?