Nathan Helm-Burger comments on Alignment Faking in Large Language Models

Nathan Helm-Burger 18 Dec 2024 21:33 UTC
LW: 11 AF: 7
4
AF
I agree with Joe Carlsmith that this seems like goal guarding.

I would be interested to see if my team’s noise-injection technique interferes with these behaviors in a way that makes them easier to detect.
- ryan_greenblatt 18 Dec 2024 23:46 UTC
  LW: 12 AF: 7
  10
  AF Parent
  It’s worth noting that the alignment faking we see in these experiments is easy to catch by default as we discuss in Appendix E.6. Still, it would be interesting to see if this makes detection even easier or triggers interestingly different catch-to-detect behaviors.
  
  You could try playing with the minimal reproduction on llama-405b.
  - Nathan Helm-Burger 19 Dec 2024 1:03 UTC
    4 points
    1
    Parent
    Yeah, for sure. My technique is trying to prepare in advance for more capable future models, not something we need yet.
    
    The idea is that if there’s a future model so smart and competent that it’s able to scheme very successfully and hide this scheming super well, then impairing the model’s capabilities in a broad untargeted way should lead to some telling slip-ups in its deception.
    - ryan_greenblatt 19 Dec 2024 4:52 UTC
      9 points
      1
      Parent
      For sure. I’m just noting that experimenting on our setup with detection techniques is less interesting than you might have hoped as it is already easy to detect (so it is harder to see if you’re doing something interesting).