(I’m moving a discussion between Paul and I from Facebook to here. The last three paragraphs at the bottom are new, if you were already following the discussion there.)
Wei> Having a security mindset is a good thing but I really hope AI control is not so analogous to security as to literally require robustness to adversarial inputs (or rather that there is a solution to the AI control problem which does not require this).
Paul> There is a question of how to model the adversary. I would be happy with dealing with adversaries who are not smarter than the best AI that can be made using available computational resources. A sufficiently powerful adversary can always compromise your system by just breaking its physical security, and it’s easy to believe that a computational adversary will always be able to do something analogous if they are smarter than you are, even if your system is “secure” in a relatively strong sense. The problem with normal security vulnerabilities is that they don’t require the attacker to be much smarter than the defender.
Wei> Modeling the adversary more realistically (instead of as all powerful) sounds like it ought to make the problem easier, but I don’t see how it actually does. The two ways of making a computational system secure that I know are 1) security proofs and 2) trying to find vulnerabilities before the adversary does and fixing them. With 1) it seems very hard to formally define what “not smarter than the best AI that can be made using available computational resources” means and take advantage of that to make the security proofs easier. With 2) we can’t hope to secure systems against adversaries that are even slightly smarter than us. (Robustness against adversarial inputs requires finding and fixing all the vulnerabilities the adversary could find, but they’re smarter than us so we can’t.)
How do you foresee robustness against adversarial inputs being achieved?
Wei> I don’t see how the red team approach (which seems to fall under my 2) can make an AI robust against adversaries “not smarter than the best AI that can be made using available computational resources”. A red team or set of red teams has certain capabilities and can find certain classes of vulnerabilities, and thereby help remove those vulnerabilities from your system. But how can you ensure that there aren’t AIs that can be made using available computational resources, with more or different capabilities, that can find additional classes of vulnerabilities?
Additionally, your red team proposal requires a human judge to, essentially, train the red team to distinguish between catastrophic failures (security breaches) and normal operation. Presumably you have to do this because “security breach” is an intuitive notion that’s hard to fully formalize. But there are bound to be catastrophic failures that a human (and hence the red teams that he trains) can’t recognize, and eventually the “best AI” will be able to trigger these. Or, since the training here is bound to be imperfect, there will be catastrophic failures that the red teams will not have learned to recognize, with the same consequences.
But, given your latest Aligned Search post where you talked about “If we can get abstraction+adversarial training working well enough, then we could present abstract versions of these inputs to R, which could then take the time to evaluate them slowly — with no individual step of the abstract evaluation being too far from R’s training distribution.” I think maybe I don’t really understand where you’re going with this latest series of posts and should just wait for more details …
This week I will put up either one or two directly relevant posts, which will hopefully make this discussion more clear (though we will definitely still have disagreements). I don’t think I have a solution or anything close, but I do think that there is a plausible approach, which we can hopefully either get to work or find to be unworkable.
My hope is to do some kind of robustness amplification (analogous to reliability amplification or capability amplification), which increases the difficulty of finding problematic inputs. We discussed this briefly in the original ALBA post (esp. this comment).
That is, I want to maintain some property like “is OK with high probability for any input distribution that is sufficiently easy to sample from.”
Reliability amplification increases the success probability for any fixed input, assuming the success probability starts off high enough, and by iterating it we can hopefully get an exponentially good success probability.
Analogously, there are some inputs on which we may always fail. Intuitively we want to shrink the size of the set of bad inputs, assuming only that “most inputs” are initially OK, so that by iterating we can make the bad set exponentially small. I think “difficulty of finding a bad input” is the right way to formalize something like “density of bad inputs.”
This process will never fix the vulnerabilities introduced by the learning process (the kind of thing I’m talking about here). And there may be limits on the maximal robustness of any function that can be learned by a particular model. We need to account for both of those things separately. In the best case robustness amplification would let us remove the vulnerabilities that came from the overseer.
I have more to say on this topic (e.g. this doesn’t make any use of the assumption that the attacker isn’t too powerful), but I think this post is the most likely point of disagreement.
(I’m moving a discussion between Paul and I from Facebook to here. The last three paragraphs at the bottom are new, if you were already following the discussion there.)
Wei> Having a security mindset is a good thing but I really hope AI control is not so analogous to security as to literally require robustness to adversarial inputs (or rather that there is a solution to the AI control problem which does not require this).
Paul> There is a question of how to model the adversary. I would be happy with dealing with adversaries who are not smarter than the best AI that can be made using available computational resources. A sufficiently powerful adversary can always compromise your system by just breaking its physical security, and it’s easy to believe that a computational adversary will always be able to do something analogous if they are smarter than you are, even if your system is “secure” in a relatively strong sense. The problem with normal security vulnerabilities is that they don’t require the attacker to be much smarter than the defender.
Wei> Modeling the adversary more realistically (instead of as all powerful) sounds like it ought to make the problem easier, but I don’t see how it actually does. The two ways of making a computational system secure that I know are 1) security proofs and 2) trying to find vulnerabilities before the adversary does and fixing them. With 1) it seems very hard to formally define what “not smarter than the best AI that can be made using available computational resources” means and take advantage of that to make the security proofs easier. With 2) we can’t hope to secure systems against adversaries that are even slightly smarter than us. (Robustness against adversarial inputs requires finding and fixing all the vulnerabilities the adversary could find, but they’re smarter than us so we can’t.)
How do you foresee robustness against adversarial inputs being achieved?
Paul> I think the only live contender is adversarial training, perhaps using something like the idea of red teams + filters described in this post: https://medium.com/ai-control/red-teams-b5b6de33dc76
Wei> I don’t see how the red team approach (which seems to fall under my 2) can make an AI robust against adversaries “not smarter than the best AI that can be made using available computational resources”. A red team or set of red teams has certain capabilities and can find certain classes of vulnerabilities, and thereby help remove those vulnerabilities from your system. But how can you ensure that there aren’t AIs that can be made using available computational resources, with more or different capabilities, that can find additional classes of vulnerabilities?
Additionally, your red team proposal requires a human judge to, essentially, train the red team to distinguish between catastrophic failures (security breaches) and normal operation. Presumably you have to do this because “security breach” is an intuitive notion that’s hard to fully formalize. But there are bound to be catastrophic failures that a human (and hence the red teams that he trains) can’t recognize, and eventually the “best AI” will be able to trigger these. Or, since the training here is bound to be imperfect, there will be catastrophic failures that the red teams will not have learned to recognize, with the same consequences.
But, given your latest Aligned Search post where you talked about “If we can get abstraction+adversarial training working well enough, then we could present abstract versions of these inputs to R, which could then take the time to evaluate them slowly — with no individual step of the abstract evaluation being too far from R’s training distribution.” I think maybe I don’t really understand where you’re going with this latest series of posts and should just wait for more details …
This week I will put up either one or two directly relevant posts, which will hopefully make this discussion more clear (though we will definitely still have disagreements). I don’t think I have a solution or anything close, but I do think that there is a plausible approach, which we can hopefully either get to work or find to be unworkable.
My hope is to do some kind of robustness amplification (analogous to reliability amplification or capability amplification), which increases the difficulty of finding problematic inputs. We discussed this briefly in the original ALBA post (esp. this comment).
That is, I want to maintain some property like “is OK with high probability for any input distribution that is sufficiently easy to sample from.”
Reliability amplification increases the success probability for any fixed input, assuming the success probability starts off high enough, and by iterating it we can hopefully get an exponentially good success probability.
Analogously, there are some inputs on which we may always fail. Intuitively we want to shrink the size of the set of bad inputs, assuming only that “most inputs” are initially OK, so that by iterating we can make the bad set exponentially small. I think “difficulty of finding a bad input” is the right way to formalize something like “density of bad inputs.”
This process will never fix the vulnerabilities introduced by the learning process (the kind of thing I’m talking about here). And there may be limits on the maximal robustness of any function that can be learned by a particular model. We need to account for both of those things separately. In the best case robustness amplification would let us remove the vulnerabilities that came from the overseer.
I wrote a post on security amplification, and for background a post on meta-execution.
I have more to say on this topic (e.g. this doesn’t make any use of the assumption that the attacker isn’t too powerful), but I think this post is the most likely point of disagreement.