I think we can interpret it as a burden-shifting argument; “Look, given the orthogonality thesis and instrumental convergence, and various other premises, and given the enormous stakes, you’d better have some pretty solid arguments that everything’s going to be fine in order to disagree with the conclusion of this book (which is that AI safety is extremely important).” As far as I know no one has come up with any such arguments, and in fact it’s now the consensus in the field that no one has found such an argument.
I suppose I disagree that at least the orthogonality thesis and instrumental convergence, on their own, shift the burden. The OT basically says: “It is physically possible to build an AI system that would try to kill everyone.” The ICT basically says: “Most possible AI systems within some particular set would try to kill everyone.” If we stop here, then we haven’t gotten very far.
To repurpose an analogy: Suppose that you lived very far back in the past and suspected the people would eventually try to send rockets with astronauts to the moon. It’s true that it’s physically possible to build a rocket that shoots astronauts out aimlessly into the depths of space. Most possible rockets that are able to leave earth’s atmosphere would also send astronauts aimlessly out into the depths of space. But I don’t think it’d be rational to conclude, on these grounds, that future astronauts will probably be sent out into the depths of space. The fact that engineers don’t want to make rockets that do this, and are reasonably intelligent, and can learn from lower-stakes experiences (e.g. unmanned rockets and toy rockets), does quite a lot of work. If you’re not worried about just one single rocket trajectory failure, but systematically more severe trajectory failures (e.g. people sending larger and larger manned rockets out into the depths of space), then the rational degree of worry becomes increasingly low.
Even sillier example: It’s possible to make poisons, and there are way more substances that are deadly to people than there are substances that inoculate people are against coronavirus, but we don’t need to worry much about killing everyone in the process of developing and deploying coronavirus vaccines. This is true even if it turned out that we don’t currently know how to make an effective coronavirus vaccine.
I think the OT and ICT on their own almost definitely aren’t enough to justify an above 1% credence in extinction from AI. To get the rational credence up into (e.g) the 10%-50% range, I think that stuff like mesa-optimization concerns, discontinuity premises, explanations of how plausible development techniques/processes could go badly wrong, and explanations of dynamics around AI unnoticed deceptive tendencies still need to do almost all of the work.
(Although a lot depends on how high a credence we’re trying to justify. A 1% credence in human extinction from misaligned AI is more than enough, IMO, to justify a ton of research effort, although it also probably has pretty different prioritization implications than a 50% credence.)
I think the purpose of the OT and ICT is to establish that lots of AI safety needs to be done. I think they are successful in this. Then you come along and give your analogy to other cases (rockets, vaccines) and argue that lots of AI safety will in fact be done, enough that we don’t need to worry about it. I interpret that as an attempt to meet the burden, rather than as an argument that the burden doesn’t need to be met.
But maybe this is a merely verbal dispute now. I do agree that OT and ICT by themselves, without any further premises like “AI safety is hard” and “The people building AI don’t seem to take safety seriously, as evidenced by their public statements and their research allocation” and “we won’t actually get many chances to fail and learn from our mistakes” does not establish more than, say, 1% credence in “AI will kill us all,” if even that. But I think it would be a misreading of the classic texts to say that they were wrong or misleading because of this; probably if you went back in time and asked Bostrom right before he published the book whether he agrees with you re the implications of OT and ICT on their own, he would have completely agreed. And the text itself seems to agree.
I do agree that OT and ICT by themselves, without any further premises like “AI safety is hard” and “The people building AI don’t seem to take safety seriously, as evidenced by their public statements and their research allocation” and “we won’t actually get many chances to fail and learn from our mistakes” does not establish more than, say, 1% credence in “AI will kill us all,” if even that. But I think it would be a misreading of the classic texts to say that they were wrong or misleading because of this; probably if you went back in time and asked Bostrom right before he published the book whether he agrees with you re the implications of OT and ICT on their own, he would have completely agreed. And the text itself seems to agree.
I mostly agree with this. (I think, in responding to your initial comment, I sort of glossed over “and various other premises”). Superintelligence and other classic presentations of AI risk definitely offer additional arguments/considerations. The likelihood of extremely discontinuous/localized progress is, of course, the most prominent one.
I think that “discontinuity + OT + ICT,” rather than “OT + ICT” alone, has typically been presented as the core of the argument. For example, the extended summary passage from Superintelligence:
An existential risk is one that threatens to cause the extinction of Earth-originating intelligent life or to otherwise permanently and drastically destroy its potential for future desirable development. Proceeding from the idea of first-mover advantage, the orthogonality thesis, and the instrumental convergence thesis, we can now begin to see the outlines of an argument for fearing that a plausible default outcome of the creation of machine superintelligence is existential catastrophe.
First, we discussed how the initial superintelligence might obtain a decisive strategic advantage. This superintelligence would then be in a position to form a singleton and to shape the future of Earth-originating intelligent life. What happens from that point onward would depend on the superintelligence’s motivations.
Second, the orthogonality thesis suggests that we cannot blithely assume that a superintelligence will necessarily share any of the final values stereotypically associated with wisdom and intellectual development in humans—scientific curiosity, benevolent concern for others, spiritual enlightenment and contemplation, renunciation of material acquisitiveness, a taste for refined culture or for the simple pleasures in life, humility and selflessness, and so forth. We will consider later whether it might be possible through deliberate effort to construct a superintelligence that values such things, or to build one that values human welfare, moral goodness, or any other complex purpose its designers might want it to serve. But it is no less possible—and in fact technically a lot easier—to build a superintelligence that places final value on nothing but calculating the decimal expansion of pi. This suggests that—absent a special effort—the first superintelligence may have some such random or reductionistic final goal.
Third, the instrumental convergence thesis entails that we cannot blithely assume that a superintelligence with the final goal of calculating the decimals of pi (or making paperclips, or counting grains of sand) would limit its activities in such a way as not to infringe on human interests. An agent with such a final goal would have a convergent instrumental reason, in many situations, to acquire an unlimited amount of physical resources and, if possible, to eliminate potential threats to itself and its goal system. Human beings might constitute potential threats; they certainly constitute physical resources.
Taken together, these three points thus indicate that the first superintelligence may shape the future of Earth-originating life, could easily have non-anthropomorphic final goals, and would likely have instrumental reasons to pursue open-ended resource acquisition. If we now reflect that human beings consist of useful resources (such as conveniently located atoms) and that we depend for our survival and flourishing on many more local resources, we can see that the outcome could easily be one in which humanity quickly becomes extinct.
There are some loose ends in this reasoning, and we shall be in a better position to evaluate it after we have cleared up several more surrounding issues. In particular, we need to examine more closely whether and how a project developing a superintelligence might either prevent it from obtaining a decisive strategic advantage or shape its final values in such a way that their realization would also involve the realization of a satisfactory range of human values. (Bostrom, p. 115-116)
If we drop the ‘likely discontinuity’ premise, as some portion of the community is inclined to do, then OT and OCT are the main things left. A lot of weight would then rests on these two theses, unless we supplement them with new premises (e.g. related to mesa-optimization.)
I’d also say that there are three especially salient secondary premises in the classic arguments: (a) even many seemingly innocuous descriptions of global utility functions (“maximize paperclips,” “make me happy,” etc.) would result in disastrous outcomes if these utility functions were optimized sufficiently well; (b) if a broadly/highly intelligent is inclined toward killing you, it may be good at hiding this fact; and (c) if you decide to run a broadly superintelligent system, and that superintelligent system wants to kill you, you may be screwed even if you’re quite careful in various regards (e.g. even if you implement “boxing” strategies). At least if we drop the discontinuity premise, though, I don’t think they’re compelling enough to bump us up to a high credence in doom.
Superintelligence and other classic presentations of AI risk definitely offer additional arguments/considerations. The likelihood of extremely discontinuous/localized progress is, of course, the most prominent one.
Perhaps what is going on here is that the arguments as stated in brief summaries like ‘orthogonality thesis + instrumental convergence’ just aren’t what the arguments actually were, and that there were from the start all sorts of empirical or more specific claims made around these general arguments.
This reminds me of Lakatos’ theory of research programs—where the core assumptions, usually logical or a priori in nature, are used to ‘spin off’ secondary hypotheses that are more empirical or easily falsifiable.
Lakatos’ model fits AI safety rather well—OT and IC are some of these non-emperical ‘hard core’ assumptions that are foundational to the research program and then in ~2010 there were some secondary assumptions, discontinuous progress, AI maximises a simple utility function etc. but in ~2020 we have some different secondary assumptions: mesa-optimisers, you get what you measure, direct evidence of current misalignment
I suppose I disagree that at least the orthogonality thesis and instrumental convergence, on their own, shift the burden. The OT basically says: “It is physically possible to build an AI system that would try to kill everyone.” The ICT basically says: “Most possible AI systems within some particular set would try to kill everyone.” If we stop here, then we haven’t gotten very far.
To repurpose an analogy: Suppose that you lived very far back in the past and suspected the people would eventually try to send rockets with astronauts to the moon. It’s true that it’s physically possible to build a rocket that shoots astronauts out aimlessly into the depths of space. Most possible rockets that are able to leave earth’s atmosphere would also send astronauts aimlessly out into the depths of space. But I don’t think it’d be rational to conclude, on these grounds, that future astronauts will probably be sent out into the depths of space. The fact that engineers don’t want to make rockets that do this, and are reasonably intelligent, and can learn from lower-stakes experiences (e.g. unmanned rockets and toy rockets), does quite a lot of work. If you’re not worried about just one single rocket trajectory failure, but systematically more severe trajectory failures (e.g. people sending larger and larger manned rockets out into the depths of space), then the rational degree of worry becomes increasingly low.
Even sillier example: It’s possible to make poisons, and there are way more substances that are deadly to people than there are substances that inoculate people are against coronavirus, but we don’t need to worry much about killing everyone in the process of developing and deploying coronavirus vaccines. This is true even if it turned out that we don’t currently know how to make an effective coronavirus vaccine.
I think the OT and ICT on their own almost definitely aren’t enough to justify an above 1% credence in extinction from AI. To get the rational credence up into (e.g) the 10%-50% range, I think that stuff like mesa-optimization concerns, discontinuity premises, explanations of how plausible development techniques/processes could go badly wrong, and explanations of dynamics around AI unnoticed deceptive tendencies still need to do almost all of the work.
(Although a lot depends on how high a credence we’re trying to justify. A 1% credence in human extinction from misaligned AI is more than enough, IMO, to justify a ton of research effort, although it also probably has pretty different prioritization implications than a 50% credence.)
I think the purpose of the OT and ICT is to establish that lots of AI safety needs to be done. I think they are successful in this. Then you come along and give your analogy to other cases (rockets, vaccines) and argue that lots of AI safety will in fact be done, enough that we don’t need to worry about it. I interpret that as an attempt to meet the burden, rather than as an argument that the burden doesn’t need to be met.
But maybe this is a merely verbal dispute now. I do agree that OT and ICT by themselves, without any further premises like “AI safety is hard” and “The people building AI don’t seem to take safety seriously, as evidenced by their public statements and their research allocation” and “we won’t actually get many chances to fail and learn from our mistakes” does not establish more than, say, 1% credence in “AI will kill us all,” if even that. But I think it would be a misreading of the classic texts to say that they were wrong or misleading because of this; probably if you went back in time and asked Bostrom right before he published the book whether he agrees with you re the implications of OT and ICT on their own, he would have completely agreed. And the text itself seems to agree.
I mostly agree with this. (I think, in responding to your initial comment, I sort of glossed over “and various other premises”). Superintelligence and other classic presentations of AI risk definitely offer additional arguments/considerations. The likelihood of extremely discontinuous/localized progress is, of course, the most prominent one.
I think that “discontinuity + OT + ICT,” rather than “OT + ICT” alone, has typically been presented as the core of the argument. For example, the extended summary passage from Superintelligence:
If we drop the ‘likely discontinuity’ premise, as some portion of the community is inclined to do, then OT and OCT are the main things left. A lot of weight would then rests on these two theses, unless we supplement them with new premises (e.g. related to mesa-optimization.)
I’d also say that there are three especially salient secondary premises in the classic arguments: (a) even many seemingly innocuous descriptions of global utility functions (“maximize paperclips,” “make me happy,” etc.) would result in disastrous outcomes if these utility functions were optimized sufficiently well; (b) if a broadly/highly intelligent is inclined toward killing you, it may be good at hiding this fact; and (c) if you decide to run a broadly superintelligent system, and that superintelligent system wants to kill you, you may be screwed even if you’re quite careful in various regards (e.g. even if you implement “boxing” strategies). At least if we drop the discontinuity premise, though, I don’t think they’re compelling enough to bump us up to a high credence in doom.
Perhaps what is going on here is that the arguments as stated in brief summaries like ‘orthogonality thesis + instrumental convergence’ just aren’t what the arguments actually were, and that there were from the start all sorts of empirical or more specific claims made around these general arguments.
This reminds me of Lakatos’ theory of research programs—where the core assumptions, usually logical or a priori in nature, are used to ‘spin off’ secondary hypotheses that are more empirical or easily falsifiable.
Lakatos’ model fits AI safety rather well—OT and IC are some of these non-emperical ‘hard core’ assumptions that are foundational to the research program and then in ~2010 there were some secondary assumptions, discontinuous progress, AI maximises a simple utility function etc. but in ~2020 we have some different secondary assumptions: mesa-optimisers, you get what you measure, direct evidence of current misalignment