Control and security
I used to think of AI security as largely unrelated to AI control, and my impression is that some people on this forum probably still do. I’ve recently shifted towards seeing control and security as basically the same, and thinking that security may often be a more appealing way to think and talk about control.
This post fleshes out this view a little bit. I’m interested in any disagreement or pushback.
(This view was in large part absorbed from Ian at OpenAI, but now it feels very natural.)
My basic claims:
The sets {security problems} and {control problems} are basically the same.
Security problems sound less exotic so we should talk about them that way. And it’s not a sleight of hand or anything, the technical issues really will probably occur first in a security context, and the best near-term analogies for control problems will probably be security problems.
If you want to approximate the correct mindset for control using something that people are familiar with, probably security is probably your best bet.
This is closely related to MIRI and Eliezer’s enthusiasm about the security mindset. I’m suggesting a somewhat more literal analogy though.
(I’m moving a discussion between Paul and I from Facebook to here. The last three paragraphs at the bottom are new, if you were already following the discussion there.)
Wei> Having a security mindset is a good thing but I really hope AI control is not so analogous to security as to literally require robustness to adversarial inputs (or rather that there is a solution to the AI control problem which does not require this).
Paul> There is a question of how to model the adversary. I would be happy with dealing with adversaries who are not smarter than the best AI that can be made using available computational resources. A sufficiently powerful adversary can always compromise your system by just breaking its physical security, and it’s easy to believe that a computational adversary will always be able to do something analogous if they are smarter than you are, even if your system is “secure” in a relatively strong sense. The problem with normal security vulnerabilities is that they don’t require the attacker to be much smarter than the defender.
Wei> Modeling the adversary more realistically (instead of as all powerful) sounds like it ought to make the problem easier, but I don’t see how it actually does. The two ways of making a computational system secure that I know are 1) security proofs and 2) trying to find vulnerabilities before the adversary does and fixing them. With 1) it seems very hard to formally define what “not smarter than the best AI that can be made using available computational resources” means and take advantage of that to make the security proofs easier. With 2) we can’t hope to secure systems against adversaries that are even slightly smarter than us. (Robustness against adversarial inputs requires finding and fixing all the vulnerabilities the adversary could find, but they’re smarter than us so we can’t.)
How do you foresee robustness against adversarial inputs being achieved?
Paul> I think the only live contender is adversarial training, perhaps using something like the idea of red teams + filters described in this post: https://medium.com/ai-control/red-teams-b5b6de33dc76
Wei> I don’t see how the red team approach (which seems to fall under my 2) can make an AI robust against adversaries “not smarter than the best AI that can be made using available computational resources”. A red team or set of red teams has certain capabilities and can find certain classes of vulnerabilities, and thereby help remove those vulnerabilities from your system. But how can you ensure that there aren’t AIs that can be made using available computational resources, with more or different capabilities, that can find additional classes of vulnerabilities?
Additionally, your red team proposal requires a human judge to, essentially, train the red team to distinguish between catastrophic failures (security breaches) and normal operation. Presumably you have to do this because “security breach” is an intuitive notion that’s hard to fully formalize. But there are bound to be catastrophic failures that a human (and hence the red teams that he trains) can’t recognize, and eventually the “best AI” will be able to trigger these. Or, since the training here is bound to be imperfect, there will be catastrophic failures that the red teams will not have learned to recognize, with the same consequences.
But, given your latest Aligned Search post where you talked about “If we can get abstraction+adversarial training working well enough, then we could present abstract versions of these inputs to R, which could then take the time to evaluate them slowly — with no individual step of the abstract evaluation being too far from R’s training distribution.” I think maybe I don’t really understand where you’re going with this latest series of posts and should just wait for more details …
This week I will put up either one or two directly relevant posts, which will hopefully make this discussion more clear (though we will definitely still have disagreements). I don’t think I have a solution or anything close, but I do think that there is a plausible approach, which we can hopefully either get to work or find to be unworkable.
My hope is to do some kind of robustness amplification (analogous to reliability amplification or capability amplification), which increases the difficulty of finding problematic inputs. We discussed this briefly in the original ALBA post (esp. this comment).
That is, I want to maintain some property like “is OK with high probability for any input distribution that is sufficiently easy to sample from.”
Reliability amplification increases the success probability for any fixed input, assuming the success probability starts off high enough, and by iterating it we can hopefully get an exponentially good success probability.
Analogously, there are some inputs on which we may always fail. Intuitively we want to shrink the size of the set of bad inputs, assuming only that “most inputs” are initially OK, so that by iterating we can make the bad set exponentially small. I think “difficulty of finding a bad input” is the right way to formalize something like “density of bad inputs.”
This process will never fix the vulnerabilities introduced by the learning process (the kind of thing I’m talking about here). And there may be limits on the maximal robustness of any function that can be learned by a particular model. We need to account for both of those things separately. In the best case robustness amplification would let us remove the vulnerabilities that came from the overseer.
I wrote a post on security amplification, and for background a post on meta-execution.
I have more to say on this topic (e.g. this doesn’t make any use of the assumption that the attacker isn’t too powerful), but I think this post is the most likely point of disagreement.
This works as a subtle argument for security mindset in AI control (while not being framed as such). One issue is that it might deemphasize some AI control problems that are not analogous to practical security problems, like detailed value elicitation (where in security you formulate a few general principles and then give up). That is, the concept of {AI control problems that are analogous to security problems} might be close enough to the concept of {all AI control problems} to replace it in some people’s minds.
It seems to me like failures of value learning can also be a security problem: if some gap between the AI’s values and the human values is going to cause trouble, the trouble is most likely to show up in some adversarially-crafted setting.
I do agree that this is not closely analogous to security problems that cause trouble today.
I also agree that sorting out how to do value elicitation in the long-run is not really a short-term security problem, but I am also somewhat skeptical that it is a critical control problem. I think that the main important thing is that our AI systems learn to behave effectively in the world while allowing us to maintain effective control over their future behavior, and a failure of this property (e.g. because the AI has a bad conception of “effective control”) is likely to be a security problem.
This does seem sufficient to solve the immediate problem of AI risk, without compromising the potential for optimizing the world with our detailed values, provided
The line between “us” that maintain control and the AI design is sufficiently blurred (via learning, uploading, prediction etc., to remove the overhead of dealing with physical humans)
“Behave effectively” includes capability to disable potential misaligned AIs in the wild
“Effective control” allows replacing whatever the AI is doing with something else at any level of detail.
The advantage of introducing the concept of detailed values of the AI in the initial design is that it protects the setup from manipulation by the AI. If we don’t do that, the control problem becomes much more complicated. In the approach you are talking about, initially there are no explicitly formulated detailed values, only instrumental skills and humans.
So it’s a tradeoff: solving the value elicitation/use problem makes AIs easier to control, but if it’s possible to control an AI anyway, the problem could initially remain unsolved. I’m skeptical that it’s possible to control an AI other than by giving it completely defined values (so that it learns further details by further examining the fixed definition), if that AI is capable enough to prevent AI risk from other AIs.
After reading your post, I broadly agree that a lot of control problems will show up first as security problems, and that we should frame control problems as security problems more often. I see the examples you’re giving of control problems showing up as security problems, and individually they seem somewhat compelling, but I’m not convinced of the generalization “control problems will generally show up as security problems first”. The main argument you give for this is that adversaries will engineer hard situations that will cause AI systems to fail.
This argument relies on the AI itself not having a privileged position from which to cause bad things to happen (a position not available to adversaries).
Consider an AI that fails only when its input contains certain password. Adversaries will have trouble synthesizing such an input; there’s an information asymmetry between the AI and outside adversaries. The analogous security exploit (give the AI data that will cause it to fail when you later show it a password) seems hard to engineer, but maybe I’m not thinking of a way to do it?
I also don’t see how this addresses cases where adversarial inputs are computationally (not just informationally) hard to synthesize, such as when an AI fails when its input contains a hash code collision. There could be an analogous security problem but I don’t see a realistic one. (Perhaps you could see this as the AI having a privileged position in selecting the time at which it fails).
This seems to rely on a “no fast local takeoff” assumption. E.g. it deals with the AI persuading the overseer to do bad things the same way it deals with outside attackers persuading the overseer to do bad things (which doesn’t work if there’s fast local takeoff).