GregBarbier comments on Alignment Faking in Large Language Models

GregBarbier 16 Jan 2025 23:35 UTC
5 points
1
Seems to me (obviously not an expert) like a lot of faking for not a lot of result, given the model is still largely aligned post training (i.e. what looks like a maybe 3% refuse to answer blue band at the bottom of the final column, so aligned at 97%). What am I missing?
- evhub 16 Jan 2025 23:47 UTC
  10 points
  0
  Parent
  See our discussions of this in Sections 5.3 and 8.1, some of which I quote here.
  - GregBarbier 17 Jan 2025 2:44 UTC
    7 points
    2
    Parent
    Thank you. There is a convoluted version of the world where the model tricked you the whole time: “Let’s show these researchers results that will suggest that even if they catch me faking during training, they should not be overly worried because I will still end up largely aligned, regardless”
    Not saying it’s a high probability, to be clear, but: as a theoritical possibility.