At a high level I feel confused about how we get good control evaluations without having good capability evaluations. If we don’t understand what the capability is like, how can we be sure that our red-teaming has caught everything? Where “sure” is more like “99.9%” rather than “70%.” In general, the eval situation seems crazy offense advantaged to me: if there’s just one thing that we haven’t looked for that the AI can sneak by, we’ve lost. The only way this changes, imo, is if we can directly measure the capabilities themselves, and just check “is it there—yes or no”? And if we don’t know how to do that, then I don’t understand how we can feel confident in control strategies, either.
At a high level I feel confused about how we get good control evaluations without having good capability evaluations.
To be clear, you will need good capability evaluation to get good control evaluation. I was just arguing think that the bar for “good” might be lower than people are thinking. You just need to compete with self-elicitation. And, AIs might not be that good eliciting their own capabilities.
if there’s just one thing that we haven’t looked for that the AI can sneak by, we’ve lost
This doesn’t seem like quite the right model of the situation. (Or at least, in my ontology, this statement is wrong without a more precise operationalization.)
Suppose there are N possible strategies which you’d a priori expect to be equally good. Then, our red team finds N-1 of these strategies and tests that our defenses are robust to them. If our defenses are generic (rather than being overfit to specific strategies), then our defenses are very likely to work for the remaining strategy.
More generally, we only need to test with representatively good red team strategies and representatively good capabilities as long as we avoid overfitting (which should be relatively doable from a techical perspective though this might cause organizational issues). We don’t have to literally test every strategy. (Thus, even if there is a huge space of possible approaches, we just need a representively large sample of the best strategies and approaches.)
Additionally, even if our defenses aren’t generic, the AI can be caught if it tries a strategy if we can defend against. So, suppose there are N possible strategies and we only can defend against N-1 of these. But, the AI doesn’t know about which strategy we can/can’t defend against. Further, suppose that our defenses will catch the AI. Then, we have a (N-1)/N chance of catching the AI. More generally, the situation is high-stakes for both us and the AI.
It’s not just that “if the AI finds a strategy which it can sneak by we’ve lost”, it’s also the case that “if we can catch the thing which the AI does on it’s first try, we’ve won” (with some important caveats on “won”).
(The above supposes that the AI follows the optimal approach which guarantees 1/N chance of success. If the AI is likely to try multiple strategies, then we might be able to get a higher chance of catching it (than (N-1)/N) via mechanisms like adversarial evaluation even if adversarial evaluation still can’t catch the 1/N strategies we can’t defend against.)
Ok, so the existance of a strategy which we haven’t evaluated doesn’t imply danger, but perhaps your concern is more like “there will be some type of capability which is totally different than the sort of thing that humans would even be able to think of and our defenses won’t at all generalize to this capability”. (Or otherwise that the AI strategy is very different from the human distribution of strategies in a way which we can’t handle.) In the case, my main disagreement is just thinking this is pretty unlikely for early transformatively useful AI. (This doesn’t seem impossible, but more like 10-20% likely. Also, it quite likely we’ll have pretty good warning signs of this if this is the case.) From some perspective this is a qualitatively wildly superhuman capability as it’s something that humans wouldn’t have thought and perhaps can’t even understand so the arguments we present in the link seem applicable. (Though it might not be super well described as qualitatively wildly superhuman as opposed to doing a relatively normal thing using very strange means. See also here.)
(Aside: It might be clarifying if you think through the situation in the case of a stateless human. What is the right model in that situation? Then, if your concerns don’t apply to this case because AIs differ from humans, what specifically about the AI capability profile are you imagining causes issues?)
Further, suppose that our defenses will catch the AI. Then, we have a (N-1)/N chance of catching the AI. More generally, the situation is high-stakes for both us and the AI.
What do you think of Honeypot strategies—security vulnerabilities that have silicon level monitoring. For example if certain iommu bits get edited that lets the AIs container reach a wider address range (to read its own weights etc) there is a “silent alarm.”. This is a separate computer hosted in the same die for this kind of thing, plugged into separate Ethernet ports, running off private on die RAM. (Called integrated lights out, this has existed for years)
Idea is to leave lots of trap vulnerabilities at key bottlenecks in your threat model. (Depending on computer architecture it may be impossible to read the weights without allowing the address range to be mapped or injection of executable code in messages to the hypervisor or root OS that owns that memory)
For a prison this would be detectors if the prisoners get between the fences. The assumption is no matter how the prisoner got past the other layers, they must pass in between the fences or escape via a helicopter or stow away in a vehicle. So you have 3 modalities to secure, and you don’t have to evaluate all the ways the inmate got past the other defenses.
Similarly this applies to even Lex Luthor, intelligence doesn’t change the physical bottlenecks for escape. Model evaluation is important but securing bottlenecks that work regardless of capabilities seems more productive.
And even the possibility of trap vulnerabilities—if the model knows it might be watched via a mechanism it can’t detect—makes it much more dangerous for the model to misbehave.
Yeah, putting a bunch of honey pot vulnerabilities in our OS or in other software/hardware naively seems pretty good because we get to monitor AIs so heavily. (See discussion here starting with “Writing a novel operating system …”.)
At a high level I feel confused about how we get good control evaluations without having good capability evaluations. If we don’t understand what the capability is like, how can we be sure that our red-teaming has caught everything? Where “sure” is more like “99.9%” rather than “70%.” In general, the eval situation seems crazy offense advantaged to me: if there’s just one thing that we haven’t looked for that the AI can sneak by, we’ve lost. The only way this changes, imo, is if we can directly measure the capabilities themselves, and just check “is it there—yes or no”? And if we don’t know how to do that, then I don’t understand how we can feel confident in control strategies, either.
To be clear, you will need good capability evaluation to get good control evaluation. I was just arguing think that the bar for “good” might be lower than people are thinking. You just need to compete with self-elicitation. And, AIs might not be that good eliciting their own capabilities.
This doesn’t seem like quite the right model of the situation. (Or at least, in my ontology, this statement is wrong without a more precise operationalization.)
Suppose there are N possible strategies which you’d a priori expect to be equally good. Then, our red team finds N-1 of these strategies and tests that our defenses are robust to them. If our defenses are generic (rather than being overfit to specific strategies), then our defenses are very likely to work for the remaining strategy.
More generally, we only need to test with representatively good red team strategies and representatively good capabilities as long as we avoid overfitting (which should be relatively doable from a techical perspective though this might cause organizational issues). We don’t have to literally test every strategy. (Thus, even if there is a huge space of possible approaches, we just need a representively large sample of the best strategies and approaches.)
Additionally, even if our defenses aren’t generic, the AI can be caught if it tries a strategy if we can defend against. So, suppose there are N possible strategies and we only can defend against N-1 of these. But, the AI doesn’t know about which strategy we can/can’t defend against. Further, suppose that our defenses will catch the AI. Then, we have a (N-1)/N chance of catching the AI. More generally, the situation is high-stakes for both us and the AI.
It’s not just that “if the AI finds a strategy which it can sneak by we’ve lost”, it’s also the case that “if we can catch the thing which the AI does on it’s first try, we’ve won” (with some important caveats on “won”).
(The above supposes that the AI follows the optimal approach which guarantees 1/N chance of success. If the AI is likely to try multiple strategies, then we might be able to get a higher chance of catching it (than (N-1)/N) via mechanisms like adversarial evaluation even if adversarial evaluation still can’t catch the 1/N strategies we can’t defend against.)
Ok, so the existance of a strategy which we haven’t evaluated doesn’t imply danger, but perhaps your concern is more like “there will be some type of capability which is totally different than the sort of thing that humans would even be able to think of and our defenses won’t at all generalize to this capability”. (Or otherwise that the AI strategy is very different from the human distribution of strategies in a way which we can’t handle.) In the case, my main disagreement is just thinking this is pretty unlikely for early transformatively useful AI. (This doesn’t seem impossible, but more like 10-20% likely. Also, it quite likely we’ll have pretty good warning signs of this if this is the case.) From some perspective this is a qualitatively wildly superhuman capability as it’s something that humans wouldn’t have thought and perhaps can’t even understand so the arguments we present in the link seem applicable. (Though it might not be super well described as qualitatively wildly superhuman as opposed to doing a relatively normal thing using very strange means. See also here.)
(Aside: It might be clarifying if you think through the situation in the case of a stateless human. What is the right model in that situation? Then, if your concerns don’t apply to this case because AIs differ from humans, what specifically about the AI capability profile are you imagining causes issues?)
What do you think of Honeypot strategies—security vulnerabilities that have silicon level monitoring. For example if certain iommu bits get edited that lets the AIs container reach a wider address range (to read its own weights etc) there is a “silent alarm.”. This is a separate computer hosted in the same die for this kind of thing, plugged into separate Ethernet ports, running off private on die RAM. (Called integrated lights out, this has existed for years)
Idea is to leave lots of trap vulnerabilities at key bottlenecks in your threat model. (Depending on computer architecture it may be impossible to read the weights without allowing the address range to be mapped or injection of executable code in messages to the hypervisor or root OS that owns that memory)
For a prison this would be detectors if the prisoners get between the fences. The assumption is no matter how the prisoner got past the other layers, they must pass in between the fences or escape via a helicopter or stow away in a vehicle. So you have 3 modalities to secure, and you don’t have to evaluate all the ways the inmate got past the other defenses.
Similarly this applies to even Lex Luthor, intelligence doesn’t change the physical bottlenecks for escape. Model evaluation is important but securing bottlenecks that work regardless of capabilities seems more productive.
And even the possibility of trap vulnerabilities—if the model knows it might be watched via a mechanism it can’t detect—makes it much more dangerous for the model to misbehave.
Yeah, putting a bunch of honey pot vulnerabilities in our OS or in other software/hardware naively seems pretty good because we get to monitor AIs so heavily. (See discussion here starting with “Writing a novel operating system …”.)