Joel Burget comments on Alignment Faking in Large Language Models

Joel Burget 19 Dec 2024 17:21 UTC
2 points
0
Scott Alexander says the deployment behavior is because the model learned “give evil answers while thinking up clever reasons that it was for the greater good” rather than “give evil answers honestly”. To what degree do you endorse this interpretation?
- ryan_greenblatt 19 Dec 2024 19:55 UTC
  8 points
  3
  Parent
  I think this it probably mostly not what is going on, see my comment here.
- Petropolitan 19 Dec 2024 18:51 UTC
  3 points
  0
  Parent
  I think LGS proposed a much simpler explanation in terms of an assistant simulacrum inside a token-predicting shoggoth