in Eliezer’s recent post, discontinuity is a strong component of points 3, 5, 6, 7, 10, 11, 12, 13, 26, 30, 31, 34, 35
I think I disagree. For example:
3,5,6,7, etc.—In a slow-takeoff world, at some point X you cross a threshold where your AI can kill everyone (if you haven’t figured out how to keep it under control), and at some point Y you cross a threshold where you & your AI can perform a “pivotal act”. IIUC, Eliezer is claiming that X occurs earlier than Y (assuming slow takeoff).
10,11,12, etc.—In slow-takeoff world, you still eventually reach a point where your AI can kill everyone (if you haven’t figured out how to keep it under control). At that point, you’ll need to train your AI in such a way that kill-everyone actions are not in the AI’s space of possible actions during training and sandbox-testing, but kill-everyone actions are in the AI’s space of possible actions during use / deployment. Thus there is an important distribution shift between training and deployment. (Unless you can create an amazingly good sandbox that both tricks the AI and allows all the same possible actions and strategies that the real world does. Seems hard, although I endorse efforts in that area.) By the same token, if the infinitesimally-less-competent AI that you deployed yesterday did not have kill-everyone actions in its space of possible actions, and the AI that you’re deploying today does, then that’s an important difference between them, despite the supposedly continuous takeoff.
26—In a slow-takeoff world, at some point Z you cross a threshold where you stop being able to understand the matrices, and at some point X you cross a threshold where your AI can kill everyone. I interpret this point as Eliezer making a claim that Z would occur earlier than X.
3,… Currently, my guess is we may want to steer to a trajectory where no single AI can kill everyone (in no point of the trajectory). Currently, no single AI can kill everyone—so maybe we want to maintain this property of the world / scale it, rather than e.g. create an AI sovereign which could unilaterally kill everyone, but will be nice instead (at least until we’ve worked out a lot more of the theory of alignment and intelligence than we had so far).
(I don’t think the “killing everyone” threshold is a clear cap on capabilities—if your replace “kill everyone” with “own everything”, it seems the property “no one owns everything” is compatible with scaling of economy.)
Hypothesis 1: humans with AI assistance can (and in fact will) build a nanobot defense system before an out-of-control AI would be powerful enough to deploy nanobots.
Hypothesis 2: humans with AI assistance can (and in fact will) build systems that robustly prevent hostile actors from tricking/bribing/hacking humanity into all-out nuclear war before an out-of-control AI would be powerful enough to do that.
Hypothesis 3,4,5,6,7…: Ditto for plagues, and disabling the power grid, and various forms of ecological collapse, and co-opting military hardware, and information warfare, etc. etc.
I think you believe that all these hypotheses are true. Is that right?
If so, this seems unlikely to me, for lots of reasons, both technological and social:
Some of the defensive measures might just be outright harder technologically than the offensive measures.
Some of the defensive measures would seem to require that humans are good at global coordination, and that humans will wisely prepare for uncertain hypothetical future threats even despite immediate cost and inconvenience.
The human-AI teams would be constrained by laws, norms, Overton window, etc., in a way that an out-of-control AI would not.
The human-AI teams would be constrained by lack-of-complete-trust-in-the-AI, in a way that an out-of-control AI would not. For example, defending nuclear weapons systems against hacking-by-an-out-of-control-AI would seem to require that humans either give their (supposedly) aligned AIs root access to the nuclear weapons computer systems, or source code and schematics for those computer systems, or similar, and none of these seem like things that military people would actually do in real life. As another example, humans may not trust their AIs to do recursive self-improvement, but an out-of-control AI probably would if it could.
There are lots of hypotheses that I listed above, plus presumably many more that we can’t think of, and they’re more-or-less conjunctive. (Not perfectly conjunctive—if just one hypothesis is false, we’re probably OK, apart from the nanobot one—but there seem to be lots of ways for 2 or 3 of the hypotheses to be false such that we’re in big trouble.)
Note that I don’t claim any special expertise, I mostly just want to help elevate this topic from unstated background assumption to an explicit argument where we figure out the right answer. :)
(I was recently discussing this topic in this thread.)
we may want to steer to a trajectory where no single AI can kill everyone
Want? Yes. We absolutely want that. So we should try to figure out whether that’s a realistic possibility. I’m suggesting that it might not be.
I think I disagree. For example:
3,5,6,7, etc.—In a slow-takeoff world, at some point X you cross a threshold where your AI can kill everyone (if you haven’t figured out how to keep it under control), and at some point Y you cross a threshold where you & your AI can perform a “pivotal act”. IIUC, Eliezer is claiming that X occurs earlier than Y (assuming slow takeoff).
10,11,12, etc.—In slow-takeoff world, you still eventually reach a point where your AI can kill everyone (if you haven’t figured out how to keep it under control). At that point, you’ll need to train your AI in such a way that kill-everyone actions are not in the AI’s space of possible actions during training and sandbox-testing, but kill-everyone actions are in the AI’s space of possible actions during use / deployment. Thus there is an important distribution shift between training and deployment. (Unless you can create an amazingly good sandbox that both tricks the AI and allows all the same possible actions and strategies that the real world does. Seems hard, although I endorse efforts in that area.) By the same token, if the infinitesimally-less-competent AI that you deployed yesterday did not have kill-everyone actions in its space of possible actions, and the AI that you’re deploying today does, then that’s an important difference between them, despite the supposedly continuous takeoff.
26—In a slow-takeoff world, at some point Z you cross a threshold where you stop being able to understand the matrices, and at some point X you cross a threshold where your AI can kill everyone. I interpret this point as Eliezer making a claim that Z would occur earlier than X.
Can we also drop the “pivotal act” frame? Thinking in “pivotal acts” seem to be one of the root causes leading to discontinuities everywhere.
3,… Currently, my guess is we may want to steer to a trajectory where no single AI can kill everyone (in no point of the trajectory). Currently, no single AI can kill everyone—so maybe we want to maintain this property of the world / scale it, rather than e.g. create an AI sovereign which could unilaterally kill everyone, but will be nice instead (at least until we’ve worked out a lot more of the theory of alignment and intelligence than we had so far).
(I don’t think the “killing everyone” threshold is a clear cap on capabilities—if your replace “kill everyone” with “own everything”, it seems the property “no one owns everything” is compatible with scaling of economy.)
Consider the following hypotheses.
Hypothesis 1: humans with AI assistance can (and in fact will) build a nanobot defense system before an out-of-control AI would be powerful enough to deploy nanobots.
Hypothesis 2: humans with AI assistance can (and in fact will) build systems that robustly prevent hostile actors from tricking/bribing/hacking humanity into all-out nuclear war before an out-of-control AI would be powerful enough to do that.
Hypothesis 3,4,5,6,7…: Ditto for plagues, and disabling the power grid, and various forms of ecological collapse, and co-opting military hardware, and information warfare, etc. etc.
I think you believe that all these hypotheses are true. Is that right?
If so, this seems unlikely to me, for lots of reasons, both technological and social:
Some of the defensive measures might just be outright harder technologically than the offensive measures.
Some of the defensive measures would seem to require that humans are good at global coordination, and that humans will wisely prepare for uncertain hypothetical future threats even despite immediate cost and inconvenience.
The human-AI teams would be constrained by laws, norms, Overton window, etc., in a way that an out-of-control AI would not.
The human-AI teams would be constrained by lack-of-complete-trust-in-the-AI, in a way that an out-of-control AI would not. For example, defending nuclear weapons systems against hacking-by-an-out-of-control-AI would seem to require that humans either give their (supposedly) aligned AIs root access to the nuclear weapons computer systems, or source code and schematics for those computer systems, or similar, and none of these seem like things that military people would actually do in real life. As another example, humans may not trust their AIs to do recursive self-improvement, but an out-of-control AI probably would if it could.
There are lots of hypotheses that I listed above, plus presumably many more that we can’t think of, and they’re more-or-less conjunctive. (Not perfectly conjunctive—if just one hypothesis is false, we’re probably OK, apart from the nanobot one—but there seem to be lots of ways for 2 or 3 of the hypotheses to be false such that we’re in big trouble.)
Note that I don’t claim any special expertise, I mostly just want to help elevate this topic from unstated background assumption to an explicit argument where we figure out the right answer. :)
(I was recently discussing this topic in this thread.)
Want? Yes. We absolutely want that. So we should try to figure out whether that’s a realistic possibility. I’m suggesting that it might not be.