Dakara comments on AI Control: Improving Safety Despite Intentional Subversion

Dakara 17 Nov 2024 21:58 UTC
1 point
0
I do believe that Shubhorup’s second edit provides another case for concern. If you combine his proposal with steganography, the collusion team has another way of successfully mitigating Ryan’s safety strategy.

This collusion team’s strategy should also ideally be addressed.
- Alex Mallen 18 Nov 2024 1:14 UTC
  3 points
  0
  Parent
  The general concern here is that your human auditors might not be perfect. I think in practice when making a safety case for a deployment, it would make sense to run a red-team/blue-team game with a human/AI red-team and your real auditors to see if your auditors are reliable.
  
  I’m not sure how the proposed steganographic attack would work though. How do you ensure that your backdoored code will hash to XYZ if you can’t see the backdoor opportunity until after you’ve decided on XYZ?