Inner alignment says, well, it’s not exactly like that. There’s going to be a loss function used to train our AIs, and the AIs themselves will have internal objective functions that they are maximizing, and both of these might not match ours.
As I understand the language, the “loss function used to train our AIs” matches “our objective function” from the classical outer alignment problem. The inner alignment problem seems to me as a separate problem rather than a “refinement of the traditional argument” (we can fail due to just an inner alignment problem; and we can fail due to just an outer alignment problem).
My understanding is that he spent one chapter talking about multipolar outcomes, and the rest of the book talking about unipolar outcomes
I’m not sure what you mean by saying “the rest of the book talking about unipolar outcomes”. In what way do the parts in the book that discuss the orthogonality thesis, instrumental convergence and Goodhart’s law assume or depend on a unipolar outcome?
This is important because if you have the point of view that AI safety must be solved ahead of time, before we actually build the powerful systems, then I would want to see specific technical reasons for why it will be so hard that we won’t solve it during the development of those systems.
Can you give an example of a hypothetical future AI system—or some outcome thereof—that should indicate that humankind ought to start working a lot more on AI safety?
The inner alignment problem seems to me as a separate problem rather than a “refinement of the traditional argument”
By refinement, I meant that the traditional problem of value alignment was decomposed into two levels, and at both levels, values need to be aligned. I am not quite sure why you have framed this as separate rather than a refinement?
I’m not sure what you mean by saying “the rest of the book talking about unipolar outcomes”. In what way do the parts in the book that discuss the orthogonality thesis, instrumental convergence and Goodhart’s law assume or depend on a unipolar outcome?
The arguments for why those things pose a risk was the relevant part of the book. Specifically, it argued that because of those factors, and the fact that a single project could gain control of the world, it was important to figure everything out ahead of time, rather than waiting until the project was close to completion. Because we don’t get a second chance.
The analogy of children playing with a bomb is a particular example. If Bostrom had opted for presenting a gradual narrative, perhaps he would have said that the children will be given increasingly powerful firecrackers and will see the explosive power grow and grow. Or perhaps the sparrows would have trained a population of mini-owls before getting a big owl.
Can you give an example of a hypothetical future AI system—or some outcome thereof—that should indicate that humankind ought to start working a lot more on AI safety?
I don’t think there’s a single moment that should cause people to panic. Rather, it will be a gradual transition into more powerful technology.
I get the sense that the crux here is more between fast / slow takeoffs than unipolar / multipolar scenarios.
In the case of a gradual transition into more powerful technology, what happens when the children of your analogy discovers recursive self improvement?
Even recursive self improvement can be framed gradually. Recursive technological improvement is thousands of years old. The phenomenon of technology allowing us to build better technology has sustained economic growth. Recursive self improvement is simply a very local form of recursive technological improvement.
You could imagine systems will gradually get better at recursive self improvement. Some will improve themselves sort-of well, and these systems will pose risks. Some other systems will improve themselves really well, and pose greater risks. But we would have seen the latter phenomenon coming ahead of time.
And since there’s no hard separation between recursive technological improvement and recursive self improvement, you could imagine technological improvement getting gradually more local, until all the relevant action is from a single system improving itself. In that case, there would also be warning signs before it was too late.
This framing really helped me think about gradual self-improvement, thanks for writing it down!
I agree with most of what you wrote. I still feel that in the case of an AGI re-writing its own code there’s some sense of intent that hasn’t been explicitly happening for the past thousand years.
Agreed, you could still model Humanity as some kind of self-improving Human + Computer Colossus (cf. Tim Urban’s framing) that somehow has some agency. But it’s much less effective at self-improving itself, and it’s not thinking “yep, I need to invent this new science to optimize this utility function”. I agree that the threshold is “when all the relevant action is from a single system improving itself”.
there would also be warning signs before it was too late
And what happens then? Will we reach some kind of global consensus to stop any research in this area? How long will it take to build a safe “single system improving itself”? How will all the relevant actors behave in the meantime?
My intuition is that in the best scenario we reach some kind of AGI Cold War situation for long periods of time.
As I understand the language, the “loss function used to train our AIs” matches “our objective function” from the classical outer alignment problem. The inner alignment problem seems to me as a separate problem rather than a “refinement of the traditional argument” (we can fail due to just an inner alignment problem; and we can fail due to just an outer alignment problem).
I’m not sure what you mean by saying “the rest of the book talking about unipolar outcomes”. In what way do the parts in the book that discuss the orthogonality thesis, instrumental convergence and Goodhart’s law assume or depend on a unipolar outcome?
Can you give an example of a hypothetical future AI system—or some outcome thereof—that should indicate that humankind ought to start working a lot more on AI safety?
By refinement, I meant that the traditional problem of value alignment was decomposed into two levels, and at both levels, values need to be aligned. I am not quite sure why you have framed this as separate rather than a refinement?
The arguments for why those things pose a risk was the relevant part of the book. Specifically, it argued that because of those factors, and the fact that a single project could gain control of the world, it was important to figure everything out ahead of time, rather than waiting until the project was close to completion. Because we don’t get a second chance.
The analogy of children playing with a bomb is a particular example. If Bostrom had opted for presenting a gradual narrative, perhaps he would have said that the children will be given increasingly powerful firecrackers and will see the explosive power grow and grow. Or perhaps the sparrows would have trained a population of mini-owls before getting a big owl.
I don’t think there’s a single moment that should cause people to panic. Rather, it will be a gradual transition into more powerful technology.
I get the sense that the crux here is more between fast / slow takeoffs than unipolar / multipolar scenarios.
In the case of a gradual transition into more powerful technology, what happens when the children of your analogy discovers recursive self improvement?
Even recursive self improvement can be framed gradually. Recursive technological improvement is thousands of years old. The phenomenon of technology allowing us to build better technology has sustained economic growth. Recursive self improvement is simply a very local form of recursive technological improvement.
You could imagine systems will gradually get better at recursive self improvement. Some will improve themselves sort-of well, and these systems will pose risks. Some other systems will improve themselves really well, and pose greater risks. But we would have seen the latter phenomenon coming ahead of time.
And since there’s no hard separation between recursive technological improvement and recursive self improvement, you could imagine technological improvement getting gradually more local, until all the relevant action is from a single system improving itself. In that case, there would also be warning signs before it was too late.
This framing really helped me think about gradual self-improvement, thanks for writing it down!
I agree with most of what you wrote. I still feel that in the case of an AGI re-writing its own code there’s some sense of intent that hasn’t been explicitly happening for the past thousand years.
Agreed, you could still model Humanity as some kind of self-improving Human + Computer Colossus (cf. Tim Urban’s framing) that somehow has some agency. But it’s much less effective at self-improving itself, and it’s not thinking “yep, I need to invent this new science to optimize this utility function”. I agree that the threshold is “when all the relevant action is from a single system improving itself”.
And what happens then? Will we reach some kind of global consensus to stop any research in this area? How long will it take to build a safe “single system improving itself”? How will all the relevant actors behave in the meantime?
My intuition is that in the best scenario we reach some kind of AGI Cold War situation for long periods of time.