I do agree that OT and ICT by themselves, without any further premises like “AI safety is hard” and “The people building AI don’t seem to take safety seriously, as evidenced by their public statements and their research allocation” and “we won’t actually get many chances to fail and learn from our mistakes” does not establish more than, say, 1% credence in “AI will kill us all,” if even that. But I think it would be a misreading of the classic texts to say that they were wrong or misleading because of this; probably if you went back in time and asked Bostrom right before he published the book whether he agrees with you re the implications of OT and ICT on their own, he would have completely agreed. And the text itself seems to agree.
I mostly agree with this. (I think, in responding to your initial comment, I sort of glossed over “and various other premises”). Superintelligence and other classic presentations of AI risk definitely offer additional arguments/considerations. The likelihood of extremely discontinuous/localized progress is, of course, the most prominent one.
I think that “discontinuity + OT + ICT,” rather than “OT + ICT” alone, has typically been presented as the core of the argument. For example, the extended summary passage from Superintelligence:
An existential risk is one that threatens to cause the extinction of Earth-originating intelligent life or to otherwise permanently and drastically destroy its potential for future desirable development. Proceeding from the idea of first-mover advantage, the orthogonality thesis, and the instrumental convergence thesis, we can now begin to see the outlines of an argument for fearing that a plausible default outcome of the creation of machine superintelligence is existential catastrophe.
First, we discussed how the initial superintelligence might obtain a decisive strategic advantage. This superintelligence would then be in a position to form a singleton and to shape the future of Earth-originating intelligent life. What happens from that point onward would depend on the superintelligence’s motivations.
Second, the orthogonality thesis suggests that we cannot blithely assume that a superintelligence will necessarily share any of the final values stereotypically associated with wisdom and intellectual development in humans—scientific curiosity, benevolent concern for others, spiritual enlightenment and contemplation, renunciation of material acquisitiveness, a taste for refined culture or for the simple pleasures in life, humility and selflessness, and so forth. We will consider later whether it might be possible through deliberate effort to construct a superintelligence that values such things, or to build one that values human welfare, moral goodness, or any other complex purpose its designers might want it to serve. But it is no less possible—and in fact technically a lot easier—to build a superintelligence that places final value on nothing but calculating the decimal expansion of pi. This suggests that—absent a special effort—the first superintelligence may have some such random or reductionistic final goal.
Third, the instrumental convergence thesis entails that we cannot blithely assume that a superintelligence with the final goal of calculating the decimals of pi (or making paperclips, or counting grains of sand) would limit its activities in such a way as not to infringe on human interests. An agent with such a final goal would have a convergent instrumental reason, in many situations, to acquire an unlimited amount of physical resources and, if possible, to eliminate potential threats to itself and its goal system. Human beings might constitute potential threats; they certainly constitute physical resources.
Taken together, these three points thus indicate that the first superintelligence may shape the future of Earth-originating life, could easily have non-anthropomorphic final goals, and would likely have instrumental reasons to pursue open-ended resource acquisition. If we now reflect that human beings consist of useful resources (such as conveniently located atoms) and that we depend for our survival and flourishing on many more local resources, we can see that the outcome could easily be one in which humanity quickly becomes extinct.
There are some loose ends in this reasoning, and we shall be in a better position to evaluate it after we have cleared up several more surrounding issues. In particular, we need to examine more closely whether and how a project developing a superintelligence might either prevent it from obtaining a decisive strategic advantage or shape its final values in such a way that their realization would also involve the realization of a satisfactory range of human values. (Bostrom, p. 115-116)
If we drop the ‘likely discontinuity’ premise, as some portion of the community is inclined to do, then OT and OCT are the main things left. A lot of weight would then rests on these two theses, unless we supplement them with new premises (e.g. related to mesa-optimization.)
I’d also say that there are three especially salient secondary premises in the classic arguments: (a) even many seemingly innocuous descriptions of global utility functions (“maximize paperclips,” “make me happy,” etc.) would result in disastrous outcomes if these utility functions were optimized sufficiently well; (b) if a broadly/highly intelligent is inclined toward killing you, it may be good at hiding this fact; and (c) if you decide to run a broadly superintelligent system, and that superintelligent system wants to kill you, you may be screwed even if you’re quite careful in various regards (e.g. even if you implement “boxing” strategies). At least if we drop the discontinuity premise, though, I don’t think they’re compelling enough to bump us up to a high credence in doom.
Superintelligence and other classic presentations of AI risk definitely offer additional arguments/considerations. The likelihood of extremely discontinuous/localized progress is, of course, the most prominent one.
Perhaps what is going on here is that the arguments as stated in brief summaries like ‘orthogonality thesis + instrumental convergence’ just aren’t what the arguments actually were, and that there were from the start all sorts of empirical or more specific claims made around these general arguments.
This reminds me of Lakatos’ theory of research programs—where the core assumptions, usually logical or a priori in nature, are used to ‘spin off’ secondary hypotheses that are more empirical or easily falsifiable.
Lakatos’ model fits AI safety rather well—OT and IC are some of these non-emperical ‘hard core’ assumptions that are foundational to the research program and then in ~2010 there were some secondary assumptions, discontinuous progress, AI maximises a simple utility function etc. but in ~2020 we have some different secondary assumptions: mesa-optimisers, you get what you measure, direct evidence of current misalignment
I mostly agree with this. (I think, in responding to your initial comment, I sort of glossed over “and various other premises”). Superintelligence and other classic presentations of AI risk definitely offer additional arguments/considerations. The likelihood of extremely discontinuous/localized progress is, of course, the most prominent one.
I think that “discontinuity + OT + ICT,” rather than “OT + ICT” alone, has typically been presented as the core of the argument. For example, the extended summary passage from Superintelligence:
If we drop the ‘likely discontinuity’ premise, as some portion of the community is inclined to do, then OT and OCT are the main things left. A lot of weight would then rests on these two theses, unless we supplement them with new premises (e.g. related to mesa-optimization.)
I’d also say that there are three especially salient secondary premises in the classic arguments: (a) even many seemingly innocuous descriptions of global utility functions (“maximize paperclips,” “make me happy,” etc.) would result in disastrous outcomes if these utility functions were optimized sufficiently well; (b) if a broadly/highly intelligent is inclined toward killing you, it may be good at hiding this fact; and (c) if you decide to run a broadly superintelligent system, and that superintelligent system wants to kill you, you may be screwed even if you’re quite careful in various regards (e.g. even if you implement “boxing” strategies). At least if we drop the discontinuity premise, though, I don’t think they’re compelling enough to bump us up to a high credence in doom.
Perhaps what is going on here is that the arguments as stated in brief summaries like ‘orthogonality thesis + instrumental convergence’ just aren’t what the arguments actually were, and that there were from the start all sorts of empirical or more specific claims made around these general arguments.
This reminds me of Lakatos’ theory of research programs—where the core assumptions, usually logical or a priori in nature, are used to ‘spin off’ secondary hypotheses that are more empirical or easily falsifiable.
Lakatos’ model fits AI safety rather well—OT and IC are some of these non-emperical ‘hard core’ assumptions that are foundational to the research program and then in ~2010 there were some secondary assumptions, discontinuous progress, AI maximises a simple utility function etc. but in ~2020 we have some different secondary assumptions: mesa-optimisers, you get what you measure, direct evidence of current misalignment