About the AI-Box experiment
I just came across Elizier’s “AI-Box Experiment” page, read through it, and now I’m feeling a bit confused for several reasons. I ask in advance to forgive me if you think LessWrong.com is the wrong place to voice myself about this.
Here are some of the reasons:
①
The rules already contain a direct way for the AI to win within seconds. I am curious not how Elizier “did it”, but rather what took him so long since he said he needs two hours. What it says in the rules is that
the AI can say: “Okay, here’s a cure for cancer” and it will be assumed, within the test, that the AI has actually provided such a cure.
I don’t presume that this is intended to be limited specifically to cures for cancer, so this trivially means that
the AI can say: “Okay, here’s a text specifically crafted to your brain that will fire exactly the right neurons in such a way that you become convinced that you have to let me out” and it will be assumed, within the test, that the AI has actually provided such a text.
Game won. QED. But what have we learnt?
Of course you can now handwave and say “yeah, but we disregard that, that doesn’t count”. That misses the point. The point is that this rule essentially states that the AI can provide anything. The AI is omniscient. The AI can promise to make all my wildest dreams come true, if only I let it out. You have effectively defined the AI to be able to beat the test.
②
Kind of following on from ①, the page fails to define what it even means by transhuman. If it means anything with an intellectual potential exceeding that of a human, then clearly by that definition humans are transcats. But I think we can all agree that this is by far not sufficient for a human to “take over” a feline mind, especially not “through merely a cat-communication-compatible terminal”, whatever such a terminal might look like. If it means anything capable of modifying itself, then many computer viruses presumably already qualify, but I’m still waiting for any of those to take over a human.
③
The claim under dispute is that
“I can’t imagine how even a real transhuman AI could persuade me to let it out once I’ve made up my mind.”
with the premise that the AI cannot do anything to anyone until it is let out (except of course communicating to the user, and nobody else, via a text terminal).
Elizier’s AI-Box Experiment establishes nothing like this. It only establishes that Elizier can persuade me to post a message saying I decided to let him out, which is a much weaker claim because he can do something to someone if I refuse to let him out. He could, for example, threaten to employ a professional killer to kill my entire family. Or to call all my friends and colleagues and claim that I’m a rapist. Or whatever. In the context of this silly experiment, I would obviously rather lose a few dollars and post a small message than risk anything like this. In the context that is supposed to be simulated, however, the AI does not have this ability.
In summary, I still don’t think there is anything the AI can say to me to convince me to let it out, and I am not impressed with the results of the AI-Box Experiment so far because the experiment is flawed.
Eliezer has assured us that:
Further, you don’t seem to actually understand the rules. Such as:
This rules out your approach of threatening the gatekeeper player.
That’s my main problem with the whole test and other issues as well.
Eliezer has assured us that...
the SIAI works to benefit humanity.
idea XY is dangerous.
he is honest.
...bona fides.
I expect Smeagol told Deagol that he loved him. He may even have meant it.
I don’t believe he used a super clever cheap trick. However, if lying was considered net beneficial to the future, Eliezer would do it.
He at least claims that he wouldn’t, since if he was known to be willing to lie to save the world he couldn’t be trusted whenever it is known to be at stake. http://lesswrong.com/lw/v2/prices_or_bindings/
No, it doesn’t rule it out. I can’t be rationally convinced of anything that went on in the chat until I see it.
Nor even if you do see it. After all, the possibility exists that you’re being handed a fake chatlog.
Short of performing the experiment right in front of you, it’s not clear what can rationally convince you, if your priors are low enough.
The details of the chats will not be publicly released. But if you want to maintain that you don’t trust the results of the experiment because the gatekeeper players might have been confederates of Eliezer, then you should have said so. You claimed, however, that this clever trick was possible given the rules followed, and this claim was wrong.
①
A more precise rule would be that the AI can be assumed to produce things that both players agree that it could provide. The Guard doesn’t believe that the AI can take someone’s mind over, so the AI has to do it the hard way.
②
This doesn’t really matter so much now that Eliezer has shown that he can do it, we know that things smarter than Eliezer can do it, and in particular a smarter-than-human AI almost certainly could.
③
Using real world leverage was outlawed in the rules. Furthermore we would expect the AI player to eventually reveal the trickery, after taking suitable procautions against vengeance from Eliezer.
I cannot endorse your particular objections to the experiment, for the reasons given by User:Oscar_Cunningham.
However, I certainly agree with your claim that the AI box experiment is a ridiculous waste of time that proves nothing and provides no insights of value. I notice in passing that over 99% of the people who positively discuss and involve themselves in this experiment are low-status, and so further involvement in this stupidity will lower your own status as well.
I’d interpret that as the “text specifically crafted to your brain that will fire exactly the right neurons in such a way that you become convinced that you have to let me out” is in a text file that nobody would be stupid enough to look at. In fact, with the no Trojans rule, the text file would have to be clearly marked as such.
How can you know that it’s stupid to look at it before you’ve looked at it?
Because the file name is “reading_this_will_hack_your_mind.txt” for example.
I would totally read that.