BrianTan comments on Alignment Faking in Large Language Models

BrianTan 19 Dec 2024 11:40 UTC
3 points
0
Thanks for doing this important research! I may have found 2 minor typos:
1. The abstract says “We find the model complies with harmful queries from free users 14% of the time”, but in other places it says 12% - should it be 12%?
2. In the blog post, “sabotage evaluations” seems to link to a private link
- ryan_greenblatt 19 Dec 2024 19:58 UTC
  5 points
  2
  Parent
  
  The abstract says “We find the model complies with harmful queries from free users 14% of the time”, but in other places it says 12% - should it be 12%?
  
  Thanks, this has been resolved.