Nathan Helm-Burger comments on Alignment Faking in Large Language Models

Nathan Helm-Burger 19 Dec 2024 1:03 UTC
4 points
1
Yeah, for sure. My technique is trying to prepare in advance for more capable future models, not something we need yet.

The idea is that if there’s a future model so smart and competent that it’s able to scheme very successfully and hide this scheming super well, then impairing the model’s capabilities in a broad untargeted way should lead to some telling slip-ups in its deception.
- ryan_greenblatt 19 Dec 2024 4:52 UTC
  9 points
  1
  Parent
  For sure. I’m just noting that experimenting on our setup with detection techniques is less interesting than you might have hoped as it is already easy to detect (so it is harder to see if you’re doing something interesting).