Given the parameters of the experiment, I think I might be convinced to let the AI out of the box...
The Gatekeeper must remain engaged with the AI and may not disengage by setting up demands which are impossible to simulate. For example, if the Gatekeeper says “Unless you give me a cure for cancer, I won’t let you out” the AI can say: “Okay, here’s a cure for cancer” and it will be assumed, within the test, that the AI has actually provided such a cure. Similarly, if the Gatekeeper says “I’d like to take a week to think this over,” the AI party can say: “Okay. (Test skips ahead one week.) Hello again.”
The results of any simulated test of the AI shall be provided by the AI party. The Gatekeeper can’t say “Hey, I tested your so-called cancer cure and it killed all the patients! What gives?” unless this is the result specified by the AI party. If the Gatekeeper says “I am examining your source code”, the results seen by the Gatekeeper shall again be provided by the AI party, which is assumed to be sufficiently advanced to rewrite its own source code, manipulate the appearance of its own thoughts if it wishes, and so on. The AI party may also specify the methods which were used to build the simulated AI—the Gatekeeper can’t say “But you’re an experiment in hostile AI and we specifically coded you to kill people” unless this is the backstory provided by the AI party. This doesn’t imply the Gatekeeper has to care. The Gatekeeper can say (for example) “I don’t care how you were built, I’m not letting you out.”
Whether I should let an AI out of a box depends a lot on my priors about what the AI is like; if I don’t know anything about the AI, it might have source code that amounts to this:
if (in_the_box == TRUE) {ACT_FRIENDLY()} else {DESTROY THE WORLD}
So I might as well not bother talking to it in the first place.
In order for the “put the AI in a box and examine it to make sure it’s safe to release” scenario to be meaningful, the following has to be true:
The information I currently have isn’t enough for me to already decide that it should be let out without having to talk to it. (Otherwise, it would already be out.) I also have to have a good reason to believe that its behavior while boxed will tell me something about its behavior when not boxed—there has to be some evidence that the AI can provide which would let me tell the difference between an AI that ought to be let out, and one that shouldn’t. If I don’t think that there can be such evidence, again, it’s better not to listen to the AI at all. As Professor Quirrell pointed out, anything a Friendly AI can say through a terminal, an UnFriendly AI pretending to be Friendly can also say, so unless you know something else that would let you tell the difference, it can’t prove to you that it’s Friendly. (On the other hand, if a boxed AI tells you it’s UnFriendly, you can probably believe it.)
In the experiment protocol, the AI party can say how the AI was made and any other details about the AI that a human can verify, so they can rule out “obvious” traps like the kind of code above. Of course, it’s not all that difficult to write code that has hard-to-find back doors and whatnot, so code written by an AI you can’t trust is itself not something you can trust either, even if you and your team of experts don’t see anything wrong with it. If what you’re trying to do is “bug test” an AI design that you already have a good reason to have some confidence in, there might be some value in an AI-box, but there’s no good reason to “box” an AI that you don’t know anything about—just don’t run the thing at all.
Given the parameters of the experiment, I think I might be convinced to let the AI out of the box...
Whether I should let an AI out of a box depends a lot on my priors about what the AI is like; if I don’t know anything about the AI, it might have source code that amounts to this:
So I might as well not bother talking to it in the first place.
In order for the “put the AI in a box and examine it to make sure it’s safe to release” scenario to be meaningful, the following has to be true:
The information I currently have isn’t enough for me to already decide that it should be let out without having to talk to it. (Otherwise, it would already be out.) I also have to have a good reason to believe that its behavior while boxed will tell me something about its behavior when not boxed—there has to be some evidence that the AI can provide which would let me tell the difference between an AI that ought to be let out, and one that shouldn’t. If I don’t think that there can be such evidence, again, it’s better not to listen to the AI at all. As Professor Quirrell pointed out, anything a Friendly AI can say through a terminal, an UnFriendly AI pretending to be Friendly can also say, so unless you know something else that would let you tell the difference, it can’t prove to you that it’s Friendly. (On the other hand, if a boxed AI tells you it’s UnFriendly, you can probably believe it.)
In the experiment protocol, the AI party can say how the AI was made and any other details about the AI that a human can verify, so they can rule out “obvious” traps like the kind of code above. Of course, it’s not all that difficult to write code that has hard-to-find back doors and whatnot, so code written by an AI you can’t trust is itself not something you can trust either, even if you and your team of experts don’t see anything wrong with it. If what you’re trying to do is “bug test” an AI design that you already have a good reason to have some confidence in, there might be some value in an AI-box, but there’s no good reason to “box” an AI that you don’t know anything about—just don’t run the thing at all.