It’s worth noting that the alignment faking we see in these experiments is easy to catch by default as we discuss in Appendix E.6. Still, it would be interesting to see if this makes detection even easier or triggers interestingly different catch-to-detect behaviors.
Yeah, for sure. My technique is trying to prepare in advance for more capable future models, not something we need yet.
The idea is that if there’s a future model so smart and competent that it’s able to scheme very successfully and hide this scheming super well, then impairing the model’s capabilities in a broad untargeted way should lead to some telling slip-ups in its deception.
For sure. I’m just noting that experimenting on our setup with detection techniques is less interesting than you might have hoped as it is already easy to detect (so it is harder to see if you’re doing something interesting).
I agree with Joe Carlsmith that this seems like goal guarding.
I would be interested to see if my team’s noise-injection technique interferes with these behaviors in a way that makes them easier to detect.
It’s worth noting that the alignment faking we see in these experiments is easy to catch by default as we discuss in Appendix E.6. Still, it would be interesting to see if this makes detection even easier or triggers interestingly different catch-to-detect behaviors.
You could try playing with the minimal reproduction on llama-405b.
Yeah, for sure. My technique is trying to prepare in advance for more capable future models, not something we need yet.
The idea is that if there’s a future model so smart and competent that it’s able to scheme very successfully and hide this scheming super well, then impairing the model’s capabilities in a broad untargeted way should lead to some telling slip-ups in its deception.
For sure. I’m just noting that experimenting on our setup with detection techniques is less interesting than you might have hoped as it is already easy to detect (so it is harder to see if you’re doing something interesting).