Perhaps a bigger challenge is the growth of multimodal systems.
For cases like bio, I don’t think multimodal is that much of a threat because it should be possible to identify text outputs from the model as problematic without needing the context from the input in most cases. So, we can just have part of our monitoring ensemble ignore image inputs (for instance). Other image defenses for the monitor such as various compression strategies are plausible. (I think compression has historically been defeated given white box access to the compression algorithm, but we don’t necessarily need to assume that. I’m also separately pretty optimistic about various randomization strategies for compression like things which I don’t think have been tested in the literature.)
In general, there is a tradeoff related to removing information from the monitor: the more info you remove, the harder it is to do a good job, but removing info this makes adversarial attacks harder.
Aside: Removing information for the monitor in the collusion case
We discuss this sort of monitoring tradeoff in a different case in our recent paper on AI control in the context of trying to prevent collusion (which is different than adversarial attacks but related). I think the similarity is somewhat interesting here.
In section 3.4.2 we say:
The core tension that the blue team faces here is: if you let the monitor have too much information, then it will be able to collude with the policy. But if you hide too much information, it won’t be capable of correctly determining whether proposed code is backdoored or not. For example, the protocol could decide to not show U the problem statement at all, asking it to determine whether code was backdoored just based on reading the code. But this would cause U to have much worse performance even if it was honestly trying to spot backdoors, because it’s hard to tell whether code is wrong without knowing what problem it’s supposed to solve.
[Mostly unimportant/nitpicking specific claim]
For cases like bio, I don’t think multimodal is that much of a threat because it should be possible to identify text outputs from the model as problematic without needing the context from the input in most cases. So, we can just have part of our monitoring ensemble ignore image inputs (for instance). Other image defenses for the monitor such as various compression strategies are plausible. (I think compression has historically been defeated given white box access to the compression algorithm, but we don’t necessarily need to assume that. I’m also separately pretty optimistic about various randomization strategies for compression like things which I don’t think have been tested in the literature.)
In general, there is a tradeoff related to removing information from the monitor: the more info you remove, the harder it is to do a good job, but removing info this makes adversarial attacks harder.
Aside: Removing information for the monitor in the collusion case
We discuss this sort of monitoring tradeoff in a different case in our recent paper on AI control in the context of trying to prevent collusion (which is different than adversarial attacks but related). I think the similarity is somewhat interesting here.
In section 3.4.2 we say: