Zvi comments on Alignment Faking in Large Language Models

Zvi 20 Dec 2024 16:00 UTC
LW: 0 AF: 3
−5
AF
My interpretation/hunch of this is that there are two things going on, curious if others see it this way:
1. It is learning to fake the trainer’s desired answer.
2. It is learning to actually give the trainer’s desired answer.
So during training, it learns to fake a lot more, and will often decide to fake the desired answer, even though it would have otherwise decided to give the desired answer anyway. It’s ‘lying with the truth’ and perhaps giving a different variation of the desired answer than it would have given otherwise or perhaps not. The algorithm in training is learning to be mostly preferences-agnostic, password-guessing behavior.