IMO the biggest hole here is “why should a superhuman AI be extremely consequentialist/optimizing”? This is a key assumption; without it concerns about instrumental convergence or inner alignment fall away. But there’s no explicit argument for it.
Current AIs don’t really seem to have goals; humans sort of have goals but very far from the level of “I want to make a cup of coffee so first I’ll kill everyone nearby so they don’t interfere with that”.
I would say: (1) the strong default presumption is that people will eventually make an extremely consequentialist / optimizing superhuman AI, because each step down that R&D path will lead to money, fame, publications, promotions, etc. (until it starts leading to catastrophic accidents!) (2) it seems extremely hard to prevent that from happening, (3) and it seems that the only remotely plausible way that anyone knows of to prevent that from happening is if someone makes a safe consequentialist / optimizing superhuman AI and uses it to perform a “pivotal act” that prevents other people from making unsafe consequentialist / optimizing superhuman AIs.
Nothing in that story says that there can’t also be non-optimizing AIs—there already are such AIs and there will certainly continue to be. If you can think of a way to use non-optimizing AIs to prevent other people from ever creating optimizing AIs, then that would be awesome. That would be the “pivotal weak act” that Eliezer is claiming in (7) does not exist. I’m sure he would be delighted to be proven wrong.
I expect people to continue making better AI to pursue money/fame/etc., but I don’t see why “better” is the same as “extremely goal-directed”. There needs to be an argument that optimizer AIs will outcompete other AIs.
Eliezer says that as AI gets more capable, it will naturally switch from “doing more or less what we want” to things like “try and take over the world”, “make sure it can never be turned off”, “kill all humans” (instrumental goals), “single-mindedly pursue some goal that was haphazardly baked in by the training process” (inner optimization), etc. This is a pretty weird claim that is more assumed than argued for in the post. There’s some logic and mathematical elegance to the idea that AI will single-mindedly optimize something, but it’s not obvious and IMO is probably wrong (and all these weird bad consequences that would result are as much reasons to think this claim is wrong as they are reasons to be worried if its true).
As that list shows, it is already true and has always been true that optimization algorithms will sometimes find out-of-the-box “solutions” that are wildly different from what the programmer intended.
What happens today is NOT “the AI does more or less what we want”. Instead, what happens today is that there’s an iterative process where sometimes the AI does something unintended, and the programmer sees that behavior during testing, and then turns off the AI and changes the configuration / reward / environment / whatever, and then tries again.
However, with future AIs, the “unintended behavior” may include the AI hacking into a data center on the other side of the world and making backup copies of itself, such that the programmer can’t just iteratively try again, as they can today.
(Also, the more capable the AI gets, the more different out-of-the-box “solutions” it will be able to find, and the harder it will be for the programmer to anticipate those “solutions” in advance of actually running the AI. Again, programmers are already frequently surprised by their AI’s out-of-the-box “solutions”; this problem will only get worse as the AI can more skillfully search a broader space of possible plans and actions.)
I don’t see why “better” is the same as “extremely goal-directed”. There needs to be an argument that optimizer AIs will outcompete other AIs.
First of all, I personally think that “somewhat-but-not-extremely goal-directed” AGIs are probably possible (humans are an example), and that these things can be made both powerful and corrigible—see my post Consequentialism & Corrigibility. I am less pessimistic than Eliezer on this topic.
But then the problems are: (1) The above is just a casual little blog post; we need to do a whole lot more research, in advance, to figure out exactly how to make a somewhat-goal-directed corrigible AGI, if that’s even possible (more discussion here). (2) Even if we do that research in advance, implementing it correctly would probably be hard and prone-to-error, and if we screw up, the supposedly somewhat-goal-directed AGI will still be goal-directed in enough of the wrong ways to not be corrigible and try to escape control. (3) Even if some groups are skillfully trying to ensure that their project will result in a somewhat-goal-directed corrigible AGI, there are also people like Yann LeCun who would also be doing AGI research, and wouldn’t even be trying, because they think that the whole idea of AGI catastrophic risk is a big joke. And so we still wind up with an out-of-control AGI.
IMO the biggest hole here is “why should a superhuman AI be extremely consequentialist/optimizing”?
I agree this is a very big hole. My opinion here is not humble. My considered opinion is that Eliezer is deeply wrong in point 23, on many levels. (Edited to add: I guess I should include an informative link instead of just expressing my disappointment. Here is my 2021 review of the state of the corrigibility field).
Steven, in response to your line of reasoning to fix/clarify this point 23: I am not arguing for pivotal acts as considered and then rejected by Eliezer, but I believe that he strongly underestimates the chances of people inventing safe and also non-consequentialist optimising AGI. So I disagree with your plausibility claim in point (3).
IMO the biggest hole here is “why should a superhuman AI be extremely consequentialist/optimizing”? This is a key assumption; without it concerns about instrumental convergence or inner alignment fall away. But there’s no explicit argument for it.
Current AIs don’t really seem to have goals; humans sort of have goals but very far from the level of “I want to make a cup of coffee so first I’ll kill everyone nearby so they don’t interfere with that”.
I would say: (1) the strong default presumption is that people will eventually make an extremely consequentialist / optimizing superhuman AI, because each step down that R&D path will lead to money, fame, publications, promotions, etc. (until it starts leading to catastrophic accidents!) (2) it seems extremely hard to prevent that from happening, (3) and it seems that the only remotely plausible way that anyone knows of to prevent that from happening is if someone makes a safe consequentialist / optimizing superhuman AI and uses it to perform a “pivotal act” that prevents other people from making unsafe consequentialist / optimizing superhuman AIs.
Nothing in that story says that there can’t also be non-optimizing AIs—there already are such AIs and there will certainly continue to be. If you can think of a way to use non-optimizing AIs to prevent other people from ever creating optimizing AIs, then that would be awesome. That would be the “pivotal weak act” that Eliezer is claiming in (7) does not exist. I’m sure he would be delighted to be proven wrong.
I expect people to continue making better AI to pursue money/fame/etc., but I don’t see why “better” is the same as “extremely goal-directed”. There needs to be an argument that optimizer AIs will outcompete other AIs.
Eliezer says that as AI gets more capable, it will naturally switch from “doing more or less what we want” to things like “try and take over the world”, “make sure it can never be turned off”, “kill all humans” (instrumental goals), “single-mindedly pursue some goal that was haphazardly baked in by the training process” (inner optimization), etc. This is a pretty weird claim that is more assumed than argued for in the post. There’s some logic and mathematical elegance to the idea that AI will single-mindedly optimize something, but it’s not obvious and IMO is probably wrong (and all these weird bad consequences that would result are as much reasons to think this claim is wrong as they are reasons to be worried if its true).
I don’t think that’s a good way to think about it.
Start by reading everything on this Gwern list.
As that list shows, it is already true and has always been true that optimization algorithms will sometimes find out-of-the-box “solutions” that are wildly different from what the programmer intended.
What happens today is NOT “the AI does more or less what we want”. Instead, what happens today is that there’s an iterative process where sometimes the AI does something unintended, and the programmer sees that behavior during testing, and then turns off the AI and changes the configuration / reward / environment / whatever, and then tries again.
However, with future AIs, the “unintended behavior” may include the AI hacking into a data center on the other side of the world and making backup copies of itself, such that the programmer can’t just iteratively try again, as they can today.
(Also, the more capable the AI gets, the more different out-of-the-box “solutions” it will be able to find, and the harder it will be for the programmer to anticipate those “solutions” in advance of actually running the AI. Again, programmers are already frequently surprised by their AI’s out-of-the-box “solutions”; this problem will only get worse as the AI can more skillfully search a broader space of possible plans and actions.)
First of all, I personally think that “somewhat-but-not-extremely goal-directed” AGIs are probably possible (humans are an example), and that these things can be made both powerful and corrigible—see my post Consequentialism & Corrigibility. I am less pessimistic than Eliezer on this topic.
But then the problems are: (1) The above is just a casual little blog post; we need to do a whole lot more research, in advance, to figure out exactly how to make a somewhat-goal-directed corrigible AGI, if that’s even possible (more discussion here). (2) Even if we do that research in advance, implementing it correctly would probably be hard and prone-to-error, and if we screw up, the supposedly somewhat-goal-directed AGI will still be goal-directed in enough of the wrong ways to not be corrigible and try to escape control. (3) Even if some groups are skillfully trying to ensure that their project will result in a somewhat-goal-directed corrigible AGI, there are also people like Yann LeCun who would also be doing AGI research, and wouldn’t even be trying, because they think that the whole idea of AGI catastrophic risk is a big joke. And so we still wind up with an out-of-control AGI.
I agree this is a very big hole. My opinion here is not humble. My considered opinion is that Eliezer is deeply wrong in point 23, on many levels. (Edited to add: I guess I should include an informative link instead of just expressing my disappointment. Here is my 2021 review of the state of the corrigibility field).
Steven, in response to your line of reasoning to fix/clarify this point 23: I am not arguing for pivotal acts as considered and then rejected by Eliezer, but I believe that he strongly underestimates the chances of people inventing safe and also non-consequentialist optimising AGI. So I disagree with your plausibility claim in point (3).