Yeah, for sure. My technique is trying to prepare in advance for more capable future models, not something we need yet.
The idea is that if there’s a future model so smart and competent that it’s able to scheme very successfully and hide this scheming super well, then impairing the model’s capabilities in a broad untargeted way should lead to some telling slip-ups in its deception.
For sure. I’m just noting that experimenting on our setup with detection techniques is less interesting than you might have hoped as it is already easy to detect (so it is harder to see if you’re doing something interesting).
Yeah, for sure. My technique is trying to prepare in advance for more capable future models, not something we need yet.
The idea is that if there’s a future model so smart and competent that it’s able to scheme very successfully and hide this scheming super well, then impairing the model’s capabilities in a broad untargeted way should lead to some telling slip-ups in its deception.
For sure. I’m just noting that experimenting on our setup with detection techniques is less interesting than you might have hoped as it is already easy to detect (so it is harder to see if you’re doing something interesting).