To summarize this comment, you’ve proposed that baseline monitoring systems could reduce risk to an acceptable level. Specifically, the monitoring system would need to correctly identify at least 5% of dangerous queries as dangerous (“5% precision”) and avoid incorrectly flagging more than 1 in 1000 safe queries as dangerous (“0.1% FPR”).
I think this level of reliability is possible today (e.g. Claude 2 would likely meet it), but it’s possible that future developments would make defense more difficult. For example, new attack methods have shown LLMs to be less robust to misuse than previously understood. (This is one downside of adversarial robustness research that will become more important as the stakes of adversarial attacks rise.) Perhaps a bigger challenge is the growth of multimodal systems. Defending vision language models is much more difficult than defending pure LLMs. As multimodality becomes standard, we might see adversarial attacks that regularly achieve >95% success rates in bypassing monitoring systems. I’m not particularly confident about how difficult monitoring will be, but it would be beneficial to have monitoring systems which would work even if defense gets much harder in the future.
Overall, these hypotheticals only offer so much information when none of these defenses has ever been publicly built or tested. I think we agree that simple monitoring strategies might be fairly effective and cheap in identifying misuse, and that progress on adversarial robustness would significantly reduce costs by improving the effectiveness of automated monitoring systems.
Perhaps a bigger challenge is the growth of multimodal systems.
For cases like bio, I don’t think multimodal is that much of a threat because it should be possible to identify text outputs from the model as problematic without needing the context from the input in most cases. So, we can just have part of our monitoring ensemble ignore image inputs (for instance). Other image defenses for the monitor such as various compression strategies are plausible. (I think compression has historically been defeated given white box access to the compression algorithm, but we don’t necessarily need to assume that. I’m also separately pretty optimistic about various randomization strategies for compression like things which I don’t think have been tested in the literature.)
In general, there is a tradeoff related to removing information from the monitor: the more info you remove, the harder it is to do a good job, but removing info this makes adversarial attacks harder.
Aside: Removing information for the monitor in the collusion case
We discuss this sort of monitoring tradeoff in a different case in our recent paper on AI control in the context of trying to prevent collusion (which is different than adversarial attacks but related). I think the similarity is somewhat interesting here.
In section 3.4.2 we say:
The core tension that the blue team faces here is: if you let the monitor have too much information, then it will be able to collude with the policy. But if you hide too much information, it won’t be capable of correctly determining whether proposed code is backdoored or not. For example, the protocol could decide to not show U the problem statement at all, asking it to determine whether code was backdoored just based on reading the code. But this would cause U to have much worse performance even if it was honestly trying to spot backdoors, because it’s hard to tell whether code is wrong without knowing what problem it’s supposed to solve.
To summarize this comment, you’ve proposed that baseline monitoring systems could reduce risk to an acceptable level. Specifically, the monitoring system would need to correctly identify at least 5% of dangerous queries as dangerous (“5% precision”) and avoid incorrectly flagging more than 1 in 1000 safe queries as dangerous (“0.1% FPR”).
I think this level of reliability is possible today (e.g. Claude 2 would likely meet it), but it’s possible that future developments would make defense more difficult. For example, new attack methods have shown LLMs to be less robust to misuse than previously understood. (This is one downside of adversarial robustness research that will become more important as the stakes of adversarial attacks rise.) Perhaps a bigger challenge is the growth of multimodal systems. Defending vision language models is much more difficult than defending pure LLMs. As multimodality becomes standard, we might see adversarial attacks that regularly achieve >95% success rates in bypassing monitoring systems. I’m not particularly confident about how difficult monitoring will be, but it would be beneficial to have monitoring systems which would work even if defense gets much harder in the future.
Overall, these hypotheticals only offer so much information when none of these defenses has ever been publicly built or tested. I think we agree that simple monitoring strategies might be fairly effective and cheap in identifying misuse, and that progress on adversarial robustness would significantly reduce costs by improving the effectiveness of automated monitoring systems.
[Mostly unimportant/nitpicking specific claim]
For cases like bio, I don’t think multimodal is that much of a threat because it should be possible to identify text outputs from the model as problematic without needing the context from the input in most cases. So, we can just have part of our monitoring ensemble ignore image inputs (for instance). Other image defenses for the monitor such as various compression strategies are plausible. (I think compression has historically been defeated given white box access to the compression algorithm, but we don’t necessarily need to assume that. I’m also separately pretty optimistic about various randomization strategies for compression like things which I don’t think have been tested in the literature.)
In general, there is a tradeoff related to removing information from the monitor: the more info you remove, the harder it is to do a good job, but removing info this makes adversarial attacks harder.
Aside: Removing information for the monitor in the collusion case
We discuss this sort of monitoring tradeoff in a different case in our recent paper on AI control in the context of trying to prevent collusion (which is different than adversarial attacks but related). I think the similarity is somewhat interesting here.
In section 3.4.2 we say: