BrianTan comments on Alignment Faking in Large Language Models

BrianTan 19 Dec 2024 12:14 UTC
1 point
0
I’ve only read the blog post and a bit of the paper so far, but do you plan to investigate how to remove alignment faking in these situations? I wonder if there are simple methods to do so without negatively affecting the model’s capabilities and safety.
- Satron 21 Dec 2024 18:31 UTC
  4 points
  2
  Parent
  Ryan Greenblatt has provided a few mechanisms for detecting/preventing alignment faking in his comment here.