If the old arguments were sound, why would researchers shift their arguments in order to make the case that AI posed a risk? I’d assume that if the old arguments worked, the new ones would be a refinement rather than a shift. Indeed many old arguments were refined, but a lot of the new arguments seem very new.
I’m not sure I understand your model. Suppose AI safety researcher Alice writes a post about a problem that Nick Bostrom did not discuss in Superintelligence back in 2014 (e.g. the inner alignment problem). That doesn’t seem to me like meaningful evidence for the proposition “the arguments in Superintelligence are not sound”.
I can’t speak for others, but the general notion of there being a single project that leaps ahead of the rest of the world, and gains superintelligent competence before any other team can even get close, seems suspicious to many researchers that I’ve talked to.
It’s been a while since I read listened to the audiobook version of Superintelligence, but I don’t recall the book arguing that the “second‐place AI lab” will likely be much far behind the leading AI lab (in subjective human time) before we get superintelligence. And even if it would have argued for that, as important as such an estimate may be, how is it relevant to the basic question of whether AI Safety is something humankind should be thinking about?
In general, the notion that there will be discontinuities in development is looked with suspicion by a number of people (though, notably some researchers still think that fast takeoff is likely).
I don’t recall the book relying on (or [EDIT: with a lot less confidence] even mentioning the possibility of) a discontinuity in capabilities. I believe it does argue that once there are AI systems that can do anything humans can, we can expect extremely fast progress.
I’m not sure I understand your model. Suppose AI safety researcher Alice writes a post about a problem that Nick Bostrom did not discuss in Superintelligence back in 2014 (e.g. the inner alignment problem)
I would call the inner alignment problem a refinement of the traditional argument from AI risk. The traditional argument was that there was going to be a powerful system that had a utility function it was maximizing and it might not match ours. Inner alignment says, well, it’s not exactly like that. There’s going to be a loss function used to train our AIs, and the AIs themselves will have internal objective functions that they are maximizing, and both of these might not match ours.
If all the new arguments were mere refinements of the old ones, then my argument would not work. I don’t think that all the new ones are refinements of the old ones, however. For an example, try to map what failure looks like onto Nick Bostrom’s model for AI risk. Influence-seeking sorta looks like what Nick Bostrom was talking about, but I don’t think “Going out with a whimper” is what he had in mind (I haven’t read the book in a while though).
It’s been a while since I read Superintelligence, but I don’t recall the book arguing that the “second‐place AI lab” will likely be much far behind the leading AI lab (in subjective human time) before we get superintelligence.
My understanding is that he spent one chapter talking about multipolar outcomes, and the rest of the book talking about unipolar outcomes, where a single team gains a decisive strategic advantage over the rest of the world (which seems impossible unless a single team surges forward in development). Robin Hanson had the same critique in his review of the book.
And even if it would have argued for that, as important as such an estimate may be, how is it relevant to the basic question of whether AI Safety is something humankind should be thinking about?
If AI takeoff is more gradual, there will be warning signs for each risk before it unrolls into a catastrophe. Consider any single source of existential risk from AI, and I can plausibly point to a source of sub-existential risk that would occur in less powerful AI systems. If we ignore risk, then a disaster would occur, but it would be minor, and this would set a precedent for safety in the future.
This is important because if you have the point of view that AI safety must be solved ahead of time, before we actually build the powerful systems, then I would want to see specific technical reasons for why it will be so hard that we won’t solve it during the development of those systems.
It’s possible that we don’t have good arguments yet, but good arguments could present themselves eventually and it would be too late at that point to go back in time and ask people in the past to start work on AI safety. I agree with this heuristic (though it’s weak, and should only be used if there are not other more pressing existential risks to work on).
I also agree that there are conceptual arguments for why we should start AI safety work now, and I’m not totally convinced that the future will be either kind or safe to humanity. It’s worth understanding the arguments for and against AI safety, lest we treat it as a team to be argued for.
Inner alignment says, well, it’s not exactly like that. There’s going to be a loss function used to train our AIs, and the AIs themselves will have internal objective functions that they are maximizing, and both of these might not match ours.
As I understand the language, the “loss function used to train our AIs” matches “our objective function” from the classical outer alignment problem. The inner alignment problem seems to me as a separate problem rather than a “refinement of the traditional argument” (we can fail due to just an inner alignment problem; and we can fail due to just an outer alignment problem).
My understanding is that he spent one chapter talking about multipolar outcomes, and the rest of the book talking about unipolar outcomes
I’m not sure what you mean by saying “the rest of the book talking about unipolar outcomes”. In what way do the parts in the book that discuss the orthogonality thesis, instrumental convergence and Goodhart’s law assume or depend on a unipolar outcome?
This is important because if you have the point of view that AI safety must be solved ahead of time, before we actually build the powerful systems, then I would want to see specific technical reasons for why it will be so hard that we won’t solve it during the development of those systems.
Can you give an example of a hypothetical future AI system—or some outcome thereof—that should indicate that humankind ought to start working a lot more on AI safety?
The inner alignment problem seems to me as a separate problem rather than a “refinement of the traditional argument”
By refinement, I meant that the traditional problem of value alignment was decomposed into two levels, and at both levels, values need to be aligned. I am not quite sure why you have framed this as separate rather than a refinement?
I’m not sure what you mean by saying “the rest of the book talking about unipolar outcomes”. In what way do the parts in the book that discuss the orthogonality thesis, instrumental convergence and Goodhart’s law assume or depend on a unipolar outcome?
The arguments for why those things pose a risk was the relevant part of the book. Specifically, it argued that because of those factors, and the fact that a single project could gain control of the world, it was important to figure everything out ahead of time, rather than waiting until the project was close to completion. Because we don’t get a second chance.
The analogy of children playing with a bomb is a particular example. If Bostrom had opted for presenting a gradual narrative, perhaps he would have said that the children will be given increasingly powerful firecrackers and will see the explosive power grow and grow. Or perhaps the sparrows would have trained a population of mini-owls before getting a big owl.
Can you give an example of a hypothetical future AI system—or some outcome thereof—that should indicate that humankind ought to start working a lot more on AI safety?
I don’t think there’s a single moment that should cause people to panic. Rather, it will be a gradual transition into more powerful technology.
I get the sense that the crux here is more between fast / slow takeoffs than unipolar / multipolar scenarios.
In the case of a gradual transition into more powerful technology, what happens when the children of your analogy discovers recursive self improvement?
Even recursive self improvement can be framed gradually. Recursive technological improvement is thousands of years old. The phenomenon of technology allowing us to build better technology has sustained economic growth. Recursive self improvement is simply a very local form of recursive technological improvement.
You could imagine systems will gradually get better at recursive self improvement. Some will improve themselves sort-of well, and these systems will pose risks. Some other systems will improve themselves really well, and pose greater risks. But we would have seen the latter phenomenon coming ahead of time.
And since there’s no hard separation between recursive technological improvement and recursive self improvement, you could imagine technological improvement getting gradually more local, until all the relevant action is from a single system improving itself. In that case, there would also be warning signs before it was too late.
This framing really helped me think about gradual self-improvement, thanks for writing it down!
I agree with most of what you wrote. I still feel that in the case of an AGI re-writing its own code there’s some sense of intent that hasn’t been explicitly happening for the past thousand years.
Agreed, you could still model Humanity as some kind of self-improving Human + Computer Colossus (cf. Tim Urban’s framing) that somehow has some agency. But it’s much less effective at self-improving itself, and it’s not thinking “yep, I need to invent this new science to optimize this utility function”. I agree that the threshold is “when all the relevant action is from a single system improving itself”.
there would also be warning signs before it was too late
And what happens then? Will we reach some kind of global consensus to stop any research in this area? How long will it take to build a safe “single system improving itself”? How will all the relevant actors behave in the meantime?
My intuition is that in the best scenario we reach some kind of AGI Cold War situation for long periods of time.
I’m not sure I understand your model. Suppose AI safety researcher Alice writes a post about a problem that Nick Bostrom did not discuss in Superintelligence back in 2014 (e.g. the inner alignment problem). That doesn’t seem to me like meaningful evidence for the proposition “the arguments in Superintelligence are not sound”.
It’s been a while since I
readlistened to the audiobook version of Superintelligence, but I don’t recall the book arguing that the “second‐place AI lab” will likely be much far behind the leading AI lab (in subjective human time) before we get superintelligence. And even if it would have argued for that, as important as such an estimate may be, how is it relevant to the basic question of whether AI Safety is something humankind should be thinking about?I don’t recall the book relying on (or [EDIT: with a lot less confidence] even mentioning the possibility of) a discontinuity in capabilities. I believe it does argue that once there are AI systems that can do anything humans can, we can expect extremely fast progress.
I would call the inner alignment problem a refinement of the traditional argument from AI risk. The traditional argument was that there was going to be a powerful system that had a utility function it was maximizing and it might not match ours. Inner alignment says, well, it’s not exactly like that. There’s going to be a loss function used to train our AIs, and the AIs themselves will have internal objective functions that they are maximizing, and both of these might not match ours.
If all the new arguments were mere refinements of the old ones, then my argument would not work. I don’t think that all the new ones are refinements of the old ones, however. For an example, try to map what failure looks like onto Nick Bostrom’s model for AI risk. Influence-seeking sorta looks like what Nick Bostrom was talking about, but I don’t think “Going out with a whimper” is what he had in mind (I haven’t read the book in a while though).
My understanding is that he spent one chapter talking about multipolar outcomes, and the rest of the book talking about unipolar outcomes, where a single team gains a decisive strategic advantage over the rest of the world (which seems impossible unless a single team surges forward in development). Robin Hanson had the same critique in his review of the book.
If AI takeoff is more gradual, there will be warning signs for each risk before it unrolls into a catastrophe. Consider any single source of existential risk from AI, and I can plausibly point to a source of sub-existential risk that would occur in less powerful AI systems. If we ignore risk, then a disaster would occur, but it would be minor, and this would set a precedent for safety in the future.
This is important because if you have the point of view that AI safety must be solved ahead of time, before we actually build the powerful systems, then I would want to see specific technical reasons for why it will be so hard that we won’t solve it during the development of those systems.
It’s possible that we don’t have good arguments yet, but good arguments could present themselves eventually and it would be too late at that point to go back in time and ask people in the past to start work on AI safety. I agree with this heuristic (though it’s weak, and should only be used if there are not other more pressing existential risks to work on).
I also agree that there are conceptual arguments for why we should start AI safety work now, and I’m not totally convinced that the future will be either kind or safe to humanity. It’s worth understanding the arguments for and against AI safety, lest we treat it as a team to be argued for.
As I understand the language, the “loss function used to train our AIs” matches “our objective function” from the classical outer alignment problem. The inner alignment problem seems to me as a separate problem rather than a “refinement of the traditional argument” (we can fail due to just an inner alignment problem; and we can fail due to just an outer alignment problem).
I’m not sure what you mean by saying “the rest of the book talking about unipolar outcomes”. In what way do the parts in the book that discuss the orthogonality thesis, instrumental convergence and Goodhart’s law assume or depend on a unipolar outcome?
Can you give an example of a hypothetical future AI system—or some outcome thereof—that should indicate that humankind ought to start working a lot more on AI safety?
By refinement, I meant that the traditional problem of value alignment was decomposed into two levels, and at both levels, values need to be aligned. I am not quite sure why you have framed this as separate rather than a refinement?
The arguments for why those things pose a risk was the relevant part of the book. Specifically, it argued that because of those factors, and the fact that a single project could gain control of the world, it was important to figure everything out ahead of time, rather than waiting until the project was close to completion. Because we don’t get a second chance.
The analogy of children playing with a bomb is a particular example. If Bostrom had opted for presenting a gradual narrative, perhaps he would have said that the children will be given increasingly powerful firecrackers and will see the explosive power grow and grow. Or perhaps the sparrows would have trained a population of mini-owls before getting a big owl.
I don’t think there’s a single moment that should cause people to panic. Rather, it will be a gradual transition into more powerful technology.
I get the sense that the crux here is more between fast / slow takeoffs than unipolar / multipolar scenarios.
In the case of a gradual transition into more powerful technology, what happens when the children of your analogy discovers recursive self improvement?
Even recursive self improvement can be framed gradually. Recursive technological improvement is thousands of years old. The phenomenon of technology allowing us to build better technology has sustained economic growth. Recursive self improvement is simply a very local form of recursive technological improvement.
You could imagine systems will gradually get better at recursive self improvement. Some will improve themselves sort-of well, and these systems will pose risks. Some other systems will improve themselves really well, and pose greater risks. But we would have seen the latter phenomenon coming ahead of time.
And since there’s no hard separation between recursive technological improvement and recursive self improvement, you could imagine technological improvement getting gradually more local, until all the relevant action is from a single system improving itself. In that case, there would also be warning signs before it was too late.
This framing really helped me think about gradual self-improvement, thanks for writing it down!
I agree with most of what you wrote. I still feel that in the case of an AGI re-writing its own code there’s some sense of intent that hasn’t been explicitly happening for the past thousand years.
Agreed, you could still model Humanity as some kind of self-improving Human + Computer Colossus (cf. Tim Urban’s framing) that somehow has some agency. But it’s much less effective at self-improving itself, and it’s not thinking “yep, I need to invent this new science to optimize this utility function”. I agree that the threshold is “when all the relevant action is from a single system improving itself”.
And what happens then? Will we reach some kind of global consensus to stop any research in this area? How long will it take to build a safe “single system improving itself”? How will all the relevant actors behave in the meantime?
My intuition is that in the best scenario we reach some kind of AGI Cold War situation for long periods of time.