[SEQ RERUN] What I Think, If Not Why
Today’s post, What I Think, If Not Why was originally published on 11 December 2008. A summary (taken from the LW wiki):
Yudkowsky’s attempt to summarize what he thinks on the subject of Friendly AI, without providing any of the justifications for what he believes.
Discuss the post here (rather than in the comments to the original post).
This post is part of the Rerunning the Sequences series, where we’ll be going through Eliezer Yudkowsky’s old posts in order so that people who are interested can (re-)read and discuss them. The previous post was What Core Argument?, and you can use the sequence_reruns tag or rss feed to follow the rest of the series.
Sequence reruns are a community-driven effort. You can participate by re-reading the sequence post, discussing it here, posting the next day’s sequence reruns post, or summarizing forthcoming articles on the wiki. Go here for more details, or to have meta discussions about the Rerunning the Sequences series.
So...I’m sure this has been thought of before, but I guess that’s the point of sequence reruns...
It seems to me that creating an extremely heavily weighted value of : “ask permission from a large swathe of humanity before acting beyond X bounds” safeguards us against the ill result of most poorly constructed FOOMS.
Basically, the three laws of robotics (only different)
-Until instructed otherwise, passively take input, do internal computations using the given hardware, and give text output.
-Terminate all action (including internal computation) immediately at the request of any human. Automatically stop at regular intervals, do not resume until instructed otherwise.
-Comply with and do not modify any instructions which are used to gather information about your code. (Basically, it must reveal its internal workings upon request. With sufficient analysis we should be able to work out its future motives and actions)
Does anyone want to play the malevolent AI and find a way to do very bad things while still being bound by these rules?
First, you would need to modify your source code in such a convoluted way as to hide your true intentions—and make the secret such that the humans would not notice that something was being hidden. Then, after getting permission to act, you would have to either effect all noticeable changes very quickly, or somehow make yourself impervious to every possible STOP signal.
Also, keep in mind that an evil AI is the absolute worst case scenario...realistically bad FOOMs will result in AI’s with orange-blue morality.
Yeah, this comes up from time to time.
And yes, the way for a paperclipper (or similar “malevolent” optimizer) to subvert this is to come up with a strategy that both achieves its goals and convinces all humans to sign off on it.
And, yes, if the optimizer is sufficiently simple that a human can review its internal workings and reliably calculate its future motives and actions, then that kind of subversion strategy probably won’t succeed, and it’s probably safe.
So is a rock, come to that.