Steven Byrnes comments on AGI Ruin: A List of Lethalities

Steven Byrnes 9 Jun 2022 15:20 UTC
6 points
−1
Eliezer says that as AI gets more capable, it will naturally switch from “doing more or less what we want” …
I don’t think that’s a good way to think about it.
Start by reading everything on this Gwern list.
As that list shows, it is already true and has always been true that optimization algorithms will sometimes find out-of-the-box “solutions” that are wildly different from what the programmer intended.
What happens today is NOT “the AI does more or less what we want”. Instead, what happens today is that there’s an iterative process where sometimes the AI does something unintended, and the programmer sees that behavior during testing, and then turns off the AI and changes the configuration / reward / environment / whatever, and then tries again.
However, with future AIs, the “unintended behavior” may include the AI hacking into a data center on the other side of the world and making backup copies of itself, such that the programmer can’t just iteratively try again, as they can today.
(Also, the more capable the AI gets, the more different out-of-the-box “solutions” it will be able to find, and the harder it will be for the programmer to anticipate those “solutions” in advance of actually running the AI. Again, programmers are already frequently surprised by their AI’s out-of-the-box “solutions”; this problem will only get worse as the AI can more skillfully search a broader space of possible plans and actions.)
I don’t see why “better” is the same as “extremely goal-directed”. There needs to be an argument that optimizer AIs will outcompete other AIs.
First of all, I personally think that “somewhat-but-not-extremely goal-directed” AGIs are probably possible (humans are an example), and that these things can be made both powerful and corrigible—see my post Consequentialism & Corrigibility. I am less pessimistic than Eliezer on this topic.
But then the problems are: (1) The above is just a casual little blog post; we need to do a whole lot more research, in advance, to figure out exactly how to make a somewhat-goal-directed corrigible AGI, if that’s even possible (more discussion here). (2) Even if we do that research in advance, implementing it correctly would probably be hard and prone-to-error, and if we screw up, the supposedly somewhat-goal-directed AGI will still be goal-directed in enough of the wrong ways to not be corrigible and try to escape control. (3) Even if some groups are skillfully trying to ensure that their project will result in a somewhat-goal-directed corrigible AGI, there are also people like Yann LeCun who would also be doing AGI research, and wouldn’t even be trying, because they think that the whole idea of AGI catastrophic risk is a big joke. And so we still wind up with an out-of-control AGI.