Satron comments on AI Control: Improving Safety Despite Intentional Subversion

Satron 3 Jan 2025 20:07 UTC
4 points
1
I was thinking of ways to improve upon the results of your paper to achieve even better control. You could try having the untrusted model and the trusted model both evaluate the code and then look for differences in how the code is evaluated.

For starters, any code below a −2 monitoring score by the trusted model is automatically run. The rest is passed to a round where the trusted model judges whether the evaluation of the untrusted model is suspiciously more naive than the evaluation of the trusted model (though you might have to ask each model to motivate the score they give the code). The rating the trusted model gave it is used unless the trusted model thinks the evaluation by the untrusted model is not suspicious at all. In cases where it is suspicious, a higher suspicion rating of the code and evaluation is used (or possibly even the sum).

After this, it should be safe to nudge the “safety” barrier a little upwards from −2. There should be less overlap between the suspicion ratings of innocent vs. backdoored code afterward, making it more likely that the 2% of code that can be manually reviewed will have a backdoor in it.

On top of this, you can also keep submitting faulty code and arguments into this process to see if the untrusted model will lie about it, which should improve the outcome further.

But here’s the neat part: you can iterate this process.

You can have the trusted model repeatedly judge whether the untrusted model’s reasoning is suspiciously naive. You’d basically have the trusted and untrusted model constantly debate the merits of code that the trusted model isn’t absolutely sure is unsuspicious, and if the trusted model ever thinks the untrusted model isn’t making sense, it further ups the total suspicion, whereas if it thinks there is nothing suspicious about the untrusted model’s reasoning at all, it’ll move the total suspicion towards what the untrusted model is arguing for (unless the untrusted model is outright saying the trusted model is correct in a way the trusted model doesn’t think is suspicious, in which case that rating will be used).

In the end, this should lead to more of the most safe code being successfully run, more of the most suspicious code being manually audited, and less of the questionable but not manually reviewable code needing to be written by the trusted model.

There are almost certainly diminishing returns, though. You can’t just do this endlessly to eliminate the “possibly suspicious” category altogether (where it’s all either safe or suspicious) and have an optimal amount of “safe” code from the untrusted model you can run directly… Plus, even if you could, there is a limit to how much time each of these models has, even if it’s orders of magnitude more time than humans have. Overall, though, as long as at least a bit of new code is marked “absolutely safe” or “2% most suspicious” each round, it should be an overall benefit. Once that stops happening, you can stop iterating and just let the trusted AI rewrite all the code that’s not absolutely safe, with humans manually reviewing the top 2% most suspicious code (probably with segments of the debate between the models that are highlighted by the trusted model as the source of its suspicion, where the arguments by the untrusted AI were more suspicious than the code itself).