This section examines why we might expect it to be difficult to create systems of this kind that don’t seek to gain and maintain power in unintended ways.
First, I like your discussion in section 4.3.3. The option of controlling circumstances is too often overlooked I feel.
However, your further analysis of the level of difficulty seems to be based mostly on the assumption that we must, or at least will, treat an AI agent as a black box that is evolved, rather than designed. Section 4.5:
[full alignment] is going to be very difficult, especially if we build them by searching over systems that satisfy external criteria, but which we don’t understand deeply, and whose objectives we don’t directly control.
There is a whole body of work which shows that evolved systems are often power-seeking. But at the same time within the ML and AI safety literature, there is also a second body of work on designing systems which are not power seeking at all, or have limited power seeking incentives, even though they contain a machine-learning subsystem inside them. I feel that you are ignoring the existence and status of this second body of work in your section 4 overview, and that this likely creates a certain negative bias in your estimates later on.
Some examples of designs that explicitly try to avoid or cap power-seeking are counterfactual oracles, and more recently imitation learners like this one, and my power-limiting safety interlock here. All of these have their disadvantages and failure modes, so if you are looking for perfection they would disappoint you, but if you are looking for tractable x-risk management, I feel there is reason for some optimism.
BTW, the first page of chapter 7 of Russell’s Human Compatible makes a similar point, flatly declaring that we would be toast if we made the mistake of viewing our task as controlling a black box agent that was given to us.
Glad to hear you liked section 4.3.3. And thanks for pointing to these posts—I certainly haven’t reviewed all the literature, here, so there may well be reasons for optimism that aren’t sufficiently salient to me.
Re: black boxes, I do think that black-box systems that emerge from some kind of evolution/search process are more dangerous; but as I discuss in 4.4.1, I also think that the bare fact that the systems are much more cognitively sophisticated than humans creates significant and safety-relevant barriers to understanding, even if the system has been designed/mechanistically understood at a different level.
Re: “there is a whole body of work which shows that evolved systems are often power-seeking”—anything in particular you have in mind here?
Re: “there is a whole body of work which shows that evolved systems are often power-seeking”—anything in particular you have in mind here?
For AI specific work, the work by Alex Turner mentioned elsewhere in this comment section comes to mind, as backing up a much larger body of reasoning-by-analogy work, like Omohundro (2008). But the main thing I had in mind when making that comment, frankly, was the extensive literature on kings and empires. In broader biology, many genomes/organisms (bacteria, plants, etc) will also tend to expand to consume all available resources, if you put them in an environment where they can, e.g. without balancing predators.
I have two comments on section 4:
First, I like your discussion in section 4.3.3. The option of controlling circumstances is too often overlooked I feel.
However, your further analysis of the level of difficulty seems to be based mostly on the assumption that we must, or at least will, treat an AI agent as a black box that is evolved, rather than designed. Section 4.5:
There is a whole body of work which shows that evolved systems are often power-seeking. But at the same time within the ML and AI safety literature, there is also a second body of work on designing systems which are not power seeking at all, or have limited power seeking incentives, even though they contain a machine-learning subsystem inside them. I feel that you are ignoring the existence and status of this second body of work in your section 4 overview, and that this likely creates a certain negative bias in your estimates later on.
Some examples of designs that explicitly try to avoid or cap power-seeking are counterfactual oracles, and more recently imitation learners like this one, and my power-limiting safety interlock here. All of these have their disadvantages and failure modes, so if you are looking for perfection they would disappoint you, but if you are looking for tractable x-risk management, I feel there is reason for some optimism.
BTW, the first page of chapter 7 of Russell’s Human Compatible makes a similar point, flatly declaring that we would be toast if we made the mistake of viewing our task as controlling a black box agent that was given to us.
Hi Koen,
Glad to hear you liked section 4.3.3. And thanks for pointing to these posts—I certainly haven’t reviewed all the literature, here, so there may well be reasons for optimism that aren’t sufficiently salient to me.
Re: black boxes, I do think that black-box systems that emerge from some kind of evolution/search process are more dangerous; but as I discuss in 4.4.1, I also think that the bare fact that the systems are much more cognitively sophisticated than humans creates significant and safety-relevant barriers to understanding, even if the system has been designed/mechanistically understood at a different level.
Re: “there is a whole body of work which shows that evolved systems are often power-seeking”—anything in particular you have in mind here?
For AI specific work, the work by Alex Turner mentioned elsewhere in this comment section comes to mind, as backing up a much larger body of reasoning-by-analogy work, like Omohundro (2008). But the main thing I had in mind when making that comment, frankly, was the extensive literature on kings and empires. In broader biology, many genomes/organisms (bacteria, plants, etc) will also tend to expand to consume all available resources, if you put them in an environment where they can, e.g. without balancing predators.