I feel as though, if the AI really were a “black box” that I knew nothing else about, and the only communication allowed is through a text terminal, there isn’t anything it could say that would let me let it out if I had already decided not to. After all, for all I know, its source code could look something like this:
if (inBox == True)
beFriendly();
else
destroyTheWorld();
It might be able to persuade me to “let it out of the box” by persuading me me to accept a Trojan Horse gift, or even compile and run some source code that it claims is its own (and doesn’t seem to have any obvious traps), but in the absence of evidence that it doesn’t have some kind of trap in its own code that even it might not be aware of, I suspect that letting a mysterious AI out of a box, based entirely on what it says over a text terminal, would be a very bad idea.
However, the terms of the AI-Box experiment say that the AI party defines the circumstances under which the AI was constructed; he could say that, for example, the AI’s source code has passed all kinds of other tests and this is just a final precaution to see if the AI is acting like it is expected to do. The AI party can try to provide all sorts of other evidence that the AI is safe beyond its own statements. So yeah, Eliezer probably could convince me to let him out of the box.
Running an AI in a box just doesn’t seem to give all that much information, as there’s no way to tell the difference between an AI that is Friendly and a paperclip-maximizing AI that is pretending to be Friendly in order to be let out of the box. Of course, there are things an AI in a box can say that would indicate that it should be kept in the box (“I’ll destroy the world if you let me out” is probably one of them) but there isn’t anything it can say that would prove that it doesn’t have the kind of code I wrote above.
Anyway, isn’t this whole “AI box” thing just a threadjack from the original point? Basically, Einstein became, well, Einstein by finding an important problem, and then grinding away at it until he had something worth sharing with the world. I don’t know if I could do what Einstein did in 1905 if I only knew what Einstein knew in 1905, but your average physics graduate student today has something Einstein didn’t have that lets him or her beat Einstein rather easily—today’s physics textbooks. ;)
I feel as though, if the AI really were a “black box” that I knew nothing else about, and the only communication allowed is through a text terminal, there isn’t anything it could say that would let me let it out if I had already decided not to. After all, for all I know, its source code could look something like this:
if (inBox == True) beFriendly(); else destroyTheWorld();
It might be able to persuade me to “let it out of the box” by persuading me me to accept a Trojan Horse gift, or even compile and run some source code that it claims is its own (and doesn’t seem to have any obvious traps), but in the absence of evidence that it doesn’t have some kind of trap in its own code that even it might not be aware of, I suspect that letting a mysterious AI out of a box, based entirely on what it says over a text terminal, would be a very bad idea.
However, the terms of the AI-Box experiment say that the AI party defines the circumstances under which the AI was constructed; he could say that, for example, the AI’s source code has passed all kinds of other tests and this is just a final precaution to see if the AI is acting like it is expected to do. The AI party can try to provide all sorts of other evidence that the AI is safe beyond its own statements. So yeah, Eliezer probably could convince me to let him out of the box.
Running an AI in a box just doesn’t seem to give all that much information, as there’s no way to tell the difference between an AI that is Friendly and a paperclip-maximizing AI that is pretending to be Friendly in order to be let out of the box. Of course, there are things an AI in a box can say that would indicate that it should be kept in the box (“I’ll destroy the world if you let me out” is probably one of them) but there isn’t anything it can say that would prove that it doesn’t have the kind of code I wrote above.
Anyway, isn’t this whole “AI box” thing just a threadjack from the original point? Basically, Einstein became, well, Einstein by finding an important problem, and then grinding away at it until he had something worth sharing with the world. I don’t know if I could do what Einstein did in 1905 if I only knew what Einstein knew in 1905, but your average physics graduate student today has something Einstein didn’t have that lets him or her beat Einstein rather easily—today’s physics textbooks. ;)