Nathan Helm-Burger comments on Why imperfect adversarial robustness doesn’t doom AI control

Nathan Helm-Burger Nov 18, 2024, 9:44 PM
5 points
2
An additional advantage the defenders have is related to this point:

Some safety measures don’t even need trusted models to look directly at untrusted output. E.g. incident investigation, or checking automatically evaluated properties of code (like whether its requested permissions match what the task requires), or reviewing minified versions of the code.

The defenders get to run both the minified code and the raw code (with comments) past multiple auditor models.

It would be a valuable red-flag for the defenders if the raw code failed to trigger an alert, but the minified code did trigger an alert. This would be an indication that possibly the raw code was executing an attack on the auditor model(s). Detecting such an event successfully would be a big win for the defenders.