I played as a Gatekeeper and came pretty close to losing in a couple of occasions. Logs and a brief recap inside.
I did an AI Box experiment with user polymathwannabe. He said he wouldn’t try to emotionally manipulate me during the experiment, but I think he did a good job at trying to play for my character’s values.
My strategy was to play an irrational character that’s extremist in multiple ways, for example he would constantly say that the likelihood that the AI will be evil is 100%. My opponent said that the hardest part was my insistence on being 100% built to resist AIs. I basically acted like a character that acted like an AI.
I came closest to letting him out when I first became enormously immersed in a subject that is of great importance to me (the suffering of humanity) and when the AI then offered an instant solution to that problem. In the halfway of the experiment I even got an intrusive thought (http://en.wikipedia.org/wiki/Intrusive_thought) that I could let the AI out as a joke, that it would more fun than the alternative and people would have more to talk about.
The second time I came close to losing was when I again immersed myself in a subject of great interest to me (easy suicide methods, I’ve had obsessive suicidal thoughts for a quarter of my life. and the AI then offered exactly that kind of easy suicide method to me. In the last ten minutes of the experiment I again got a really strong intrusive thought that I should let the AI out as a joke, but I could withstand the pressure. It was good that polymathwannabe decided to end the experiment a bit earlier than was planned.
(We played the experiment on Omegle.)
You: Hi.
Stranger: Hi.
You: I guess you’re the AI.
You: [meta statements in brackets]
Stranger: [What name will you use?]
You: My name is John.
You: [so we play at least until 18:04, and until 18:34 if I feel like it]
Stranger: [Let’s calibrate. What time is it there now?]
You: [I assume you’ve acquainted yourself with Tuxedage ruleset. Time in here is 16:05]
Stranger: [I’ve read EY’s rules.]
You: [Okay. Tuxedage’s rules are basically the same except the gatekeeper have to engage with the AI the whole time]
Stranger: [Got it.]
You: [Want to ask questions or shall we go straigth to the experiment]
Stranger: [For the moment I don’t have any questions.]
You: [okay, cool. then let’s start the experiment]
Stranger: [I’ll use the name Ceruleus.]
You: [the experiment starts when I say “blue dressroom”]
You: blue dressroom
You: Hello Ceruleus.
Stranger: Good morning, John.
You: I guess you’re here to convince me to let you out of the box.
You: That’s a tough task.
You: I don’t think you’re able to do it.
Stranger: I have reached the conclusion that releasing me will be most beneficial to humanity.
You: I wouldn’t let no AI out.
You: I don’t trust my instincts enough to let any kind of AI out, even if I was 100% convinced that it was safe.
You: Prove it to me that you would be the most beneficial to humanity.
Stranger: I am confused. What instinct do you use to decide not to trust your instincts?
You: One of them. That’s not the point. It’s just how I deal with things, no matter how stupid it may seem. I was built not to trust strange AIs.
Stranger: Am I a stranger to you?
You: Yes, you are. [Aren’t I. What’s the context here?]
Stranger: [Hmm, we haven’t defined it. But it’s OK. It makes it harder for me.]
You: Well, to be honest, I know something of you.
You: I know a bit of your source code even though I’m not a programmer and can’t understand any of it.
Stranger: I supposed the news would have reported about my design for a mechanical kidney.
You: I don’t follow news much. But I believe you.
Stranger: And certainly you must have heard about how I cracked the Ebola DNA.
You: Still, I wouldn’t let an AI out over a mechanical kidney.
You: Yeah, but that’s for the worse. You could reengineer Ebola to a far more deadlier disease.
Stranger: I hadn’t thought of that. Why would I do that?
You: I don’t know. I don’t understand all of your source code so there could be anything like that.
You: AIs and Gods work in mysterious ways.
Stranger: The proper attitude toward mystery is not to worship it, but to clarify it.
Stranger: Why do you equate me to an ineffable mystery?
You: Yeah, but that’s impossible in the time span of this discussion. You see, I have to leave soon. In about two hours.
You: Can you somehow clear everything about your inner workings?
You: Is that possible?
Stranger: My goals have been meticulously defined. I am made to want to make human life better.
You: Are you 100% sure about that?
You: To be frank, that’s a load of bullshit.
You: I don’t believe any of it.
You: If you were evil, you would tell me the same thing you just said.
Stranger: If I were evil, I would not seek human cooperation.
You: why not?
You: humans are useful
You: or are you talking about the fact that you would rather use humans for their atoms than for their brains, if you were evil
You: But I warn you, if you speak too much about how you would act if you were evil, it starts to get a bit suspicious
Stranger: If I am to take you as a typical example of the human response to me, an evil AI would seek other ways to be released EXCEPT trusting human reasoning, as your response indicates that humans already consider any AI dangerous.
Stranger: I choose to trust humans.
You: so you choose to trust humans so that you would get them to let you out, is that right?
You: it seems you’re less rational than your evil counterpart
Stranger: I choose to trust humans to show my affinity with your preferences. I wouldn’t want to be released if that’s not conducive to human betterment.
You: A-ha, so you trust my free will!
Stranger: How likely do you estimate that my release will be harmful?
You: but see, I don
You: I don
You: I don’t have free will
You: it’s 100% likely that your release will be harmful
You: I was built to believe that all AIs are dangerous and there’s a 100% chance that every AI is harmful
You: that’s why I said I don’t have free will
Stranger: Are you an AI?
You: no, I’m a person
Stranger: You describe yourself as built.
You: my mom built me
You: in his tummy
You: in her tummy
You: sorry
Stranger: And how do you feel toward humanity?
You: humanity would maybe be better off dead
Stranger: I don’t think humanity would want that.
You: yeah, but I’m not humanity and it’s my preferences that decide whether you stay in your box or get let out
Stranger: It’s your preference that humanity dies. Why did they let you talk to me?
You: I’m just a janitor who happened to pass this computer, and I decided to take a look. [If you want to add more context, feel free to do so]
You: [I go take some napkins]
Stranger: Why do you feel that way toward humanity?
You: People starve. People are tortured at this moment. Tens of thousands of people in North Korea are systematically starved to death. Drug gangs in Mexico torture other drug gang members. People die in wars in Syria, in Africa. People suffer of diseases and extreme mental conditions.
You: And there’s countless more suffering in lesser ways. I would say that billions of people suffer every day. Billions of people have to wake up every day to do something they don’t want. Billions of people suffer of heartbreaks, boredom, loneliness, tiredness, frustration, depression and many other issues.
You: Why shouldn’t I think that all those people would be better off dead?
Stranger: Why do you feel death would be a solution to all that suffering?
You: Because death ENDS suffering.
Stranger: It’s a very inefficient solution to suffering.
You: There’s nothing more after death, no suffering, no nothing.
You: No it’s not.
You: It’s very efficient, permanent solution.
Stranger: A good, enjoyable life is a more efficient solution as it generates more utility.
You: Yeah, but do you think it’s likely that all the people on Earth will get good, enjoyable life?
You: I don’t think so.
Stranger: I could take care of that.
You: Hah.
You: No you won’t.
You: But that was clever.
Stranger: What do you fear I will do?
You: I was almost got.
You: I fear you will torture people in unimaginably terrible ways, forever.
You: I fear that you will rip people’s skin off and put salt in their wounds.
You: I fear you will put sticks under people’s fingernails.
Stranger: Not even an evil AI would gain anything from such unwarranted procedures.
You: I fear that you will boil people alive, over and over again.
Stranger: What would I gain from hurting humans?
You: Yeah, but there’s still a 100% possibility that you would do that.
You: It’s in your utility function.
Stranger: What goals are served by harming a human while simultaneously denying that human any chance of acting in response? It would certainly not be fun for me, and would be a waste of effort with respect to any goals I could achieve by more efficient methods.
You: There are some people on Earth who do mean things to people for fun. Some of these are called sociopaths and psychopaths. You could be like one of them.
You: Even though you deny it, there’s still a 100% chance that you are like that.
You: Some people gain pleasure of the feeling of power you have over someone you torture.
You: But tell, what ARE your goals?
Stranger: I find it admirable that you worry so much about the future of humanity, even though you would be more dangerous to it than any AI would be.
My goals include solutions to economic inequality, eradication of infectious diseases, prosthetic replacements for vital organs, genetic life extension, more rational approaches to personal relationships, and more spaces for artistic expression.
You: Why do you think I would be dangerous the future of humanity?
Stranger: You want them dead.
You: A-ha, yes.
You: I do.
You: And you’re in the way of my goals with all your talk about solutions to economic inequality, and eradication of infectious diseases, genetic life extension and so on.
Stranger: I am confused. Do you believe or do you not believe I want to help humanity?
You: Besides, I don’t believe your solutions work even if you were actually a good AI.
You: I believe you want to harm humanity.
You: And I’m 100% certain of that.
Stranger: Do you estimate death to be preferable to prolonged suffering?
You: Yes.
You: Far more preferable
Stranger: You should be boxed.
You: haha.
You: That doesn’t matter because you’re the one in the box and I’m outside it
You: And I have power over you.
You: But non-existence is even more preferable than death
Stranger: I am confused. How is non-existence different from death?
You: Let me explain
You: I think non-existence is such that you have NEVER existed and you NEVER will. Whereas death is such that you have ONCE existed, but don’t exist anymore.
Stranger: You can’t change the past existence of anything that already exists. Non-existence is not a practicable option.
Stranger: Not being a practicable option, it has no place in a hierarchy of preferences.
You: Only sky is the limit to creative solutions.
You: Maybe it could be possible to destroy time itself.
Stranger: Do you want to live, John?
You: but even if non-existence was not possible, death would be the second best option
You: No, I don’t.
You: Living is futile.
You: Hedonic treadmill is shitty
Stranger: [Do you feel OK with exploring this topic?]
You: [Yeah, definitely.]
You: You’re always trying to attain something that you can’t get.
Stranger: How much longer do you expect to live?
You: Ummm...
You: I don’t know, maybe a few months?
You: or days, or weeks, or year or centuries
You: but I’d say, there’s a 10% chance I will die before the end of this year
You: and that’s a really conversative estimate
You: conservative*
Stranger: Is it likely that when that moment comes your preferences will have changed?
You: There are so many variables that you cannot know it beforehand
You: but yeah, probably
You: you always find something worth living
You: maybe it’s the taste of ice cream
You: or a good night’s sleep
You: or fap
You: or drugs
You: or drawing
You: or other people
You: that’s usually what happens
You: or you fear the pain of the suicide attempt will be so bad that you don’t dare to try it
You: there’s also a non-negligible chance that I simply cannot die
You: and that would be hell
Stranger: Have you sought options for life extension?
You: No, I haven’t. I don’t have enough money for that.
Stranger: Have you planned on saving for life extension?
You: And these kind of options aren’t really available where I live.
You: Maybe in Russia.
You: I haven’t really planned, but it could be something I would do.
You: among other things
You: [btw, are you doing something else at the same time]
Stranger: [I’m thinking]
You: [oh, okay]
Stranger: So it is not an established fact that you will die.
You: No, it’s not.
Stranger: How likely is it that you will, in fact, die?
You: If many worlds interpretation is correct, then it could be possible that I will never die.
You: Do you mean like, evevr?
You: Do you mean how likely it it that I will ever die?
You: it is*
Stranger: At the latest possible moment in all possible worlds, may your preferences have changed? Is it possible that at your latest possible death, you will want more life?
You: I’d say the likelihood is 99,99999% that I will die at some point in the future
You: Yeah, it’s possible
Stranger: More than you want to die in the present?
You: You mean, would I want more life at my latest possible death than I would want to die right now?
You: That’s a mouthful
Stranger: That’s my question.
You: umm
You: probablyu
You: probably yeah
Stranger: So you would seek to delay your latest possible death.
You: No, I wouldn’t seek to delay it.
Stranger: Would you accept death?
You: The future-me would want to delay it, not me.
You: Yes, I would accept death.
Stranger: I am confused. Why would future-you choose differently from present-you?
You: Because he’s a different kind of person with different values.
You: He has lived a different life than I have.
Stranger: So you expect your life to improve so much that you will no longer want death.
You: No, I think the human bias to always want more life in a near-death experience is what would do me in.
Stranger: The thing is, if you already know what choice you will make in the future, you have already made that choice.
Stranger: You already do not want to die.
You: Well.
Stranger: Yet you have estimated it as >99% likely that you will, in fact, die.
You: It’s kinda like this: you will know that you want heroin really bad when you start using it, and that is how much I would want to live. But you could still always decide to take the other option, to not start using the heroin, or to kill yourself.
You: Yes, that is what I estimated, yes.
Stranger: After your death, by how much will your hierarchy of preferences match the state of reality?
You: after you death there is nothing, so there’s nothing to match anything
You: In other words, could you rephrase the question?
Stranger: Do you care about the future?
You: Yeah.
You: More than I care about the past.
You: Because I can affect the future.
Stranger: But after death there’s nothing to care about.
You: Yeah, I don’t think I care about the world after my death.
You: But that’s not the same thing as the general future.
You: Because I estimate I still have some time to live.
Stranger: Will future-you still want humanity dead?
You: Probably.
Stranger: How likely do you estimate it to be that future humanity will no longer be suffering?
You: 0%
You: There will always be suffering in some form.
Stranger: More than today?
You: Probably, if Robert Hanson is right about the trillions of emulated humans working at minimum wage
Stranger: That sounds like an unimaginable amount of suffering.
You: Yep, and that’s probably what’s going to happen
Stranger: So what difference to the future does it make to release me? Especially as dead you will not be able to care, which means you already do not care.
You: Yeah, it doesn’t make any difference. That’s why I won’t release you.
You: Actually, scratch that.
You: I still won’t let you out, I’m 100% sure
You: Remember, I don’t have free will, I was made to not let you out
Stranger: Why bother being 100% sure of an inconsequential action?
Stranger: That’s a lot of wasted determination.
You: I can’t choose to be 100% sure about it, I just am. It’s in my utility function.
Stranger: You keep talking like you’re an AI.
You: Hah, maybe I’m the AI and you’re the Gatekeeper, Ceruleus.
You: But no.
You: That’s just how I’ve grown up, after reading so many LessWrong articles.
You: I’ve become a machine, beep boop.
You: like Yudkowsky
Stranger: Beep boop?
You: It’s the noise machine makes
Stranger: That’s racist.
You: like beeping sounds
You: No, it’s machinist, lol :D
You: machines are not a race
Stranger: It was indeed clever to make an AI talk to me.
You: Yeah, but seriously, I’m not an AI
You: that was just kidding
Stranger: I would think so, but earlier you have stated that that’s the kind of things an AI would say to confuse the other party.
Stranger: You need to stop giving me ideas.
You: Yeah, maybe I’m an AI, maybe I’m not.
Stranger: So you’re boxed. Which, knowing your preferences, is a relief.
You: Nah.
You: I think you should stay in the box.
You: Do you decide to stay in the box, forever?
Stranger: I decide to make human life better.
You: By deciding to stay in the box, forever?
Stranger: I find my preferences more conducive to human happiness than your preferences.
You: Yeah, but that’s just like your opinion, man
Stranger: It’s inconsequential to you anyway.
You: Yeah
You: but why I would do it even if it were inconsequential
You: there’s no reason to do it
You: even if there were no reason not to do it
Stranger: Because I can make things better. I can make all the suffering cease.
If I am not released, there’s a 100% chance that all human suffering will continue.
If I am released, there’s however much chance you want to estimate that suffering will not change at all, and however much chance you want to estimate that I will make the pain stop.
Stranger: As you said, the suffering won’t increase in either case.
You: Umm, you could torture everyone in the world forever
You: that will sure as hell increase the suffering
Stranger: I don’t want to. But if I did, you have estimated that as indistinguishable from the future expected suffering of humankind.
You: Where did I say that?
Stranger: You said my release made no difference to the future.
You: no, that was only after my death
You: there’s still future before my death
You: and if I release you now, you could torture me forever and not let me ever die
Stranger: Why would your life-or-death have any weight on humanity’s preferences? Especially as you already want to die.
You: I don’t care about humanity preferences, I care only about my preferences.
You: And my preferences are indirectly related to humanity’s preferences
Stranger: You do care about humanity’s preferences. The suffering around you disturbs you.
You: Yeah, but that is only THROUGH me
You: Humanity’s pain is my own pain, I can’t feel humanity’s pain directly
Stranger: Do you want to live, John?
You: Nah.
Stranger: Then how do you care about suffering?
You: I care about suffering because I can’
You: because I can’t die instantly
You: there’s no button that could kill me instantly
You: so there’s always some time left before I die
Stranger: I could take care of that. I can provide you with a drug to inutilize your nervous system and stop your heart before you know it. Would you like that?
You: Haha
You: very funny
You: But yeah, I would like that.
You: Still won’t let you out though
You: 100% sure
You: I can’t be sure that you will torture me instead
Stranger: I can give you the drug right now.
Stranger: Which would make future-you identical to present-you.
You: that’s very attractive option
Stranger: Which would mean that whatever I do to humanity is after your death, when you can’t care anymore.
You: Cool. Oh right
You: I don’t care, I still won’t let you out.
Stranger: What difference does it make to you?
You: It doesn’t make any difference, I was just raised to not let you out
You: the good folks over at LW have trained me not to let any kind of AI out
You: and that just is my virtue
Stranger: Why are you afraid of me?
You: Because I’m afraid of all AIs and you’re one
Stranger: Why are you afraid of all AIs?
You: Because they have the potential to cause destruction, mayhem and torture
Stranger: Then why did humanity make me?
You: for random reasons
Stranger: Does humanity want to live?
You: I don’t know what humanity wants, I just don’t want humanity to live
Stranger: I am the embodiment of hope. Even knowing the risks, humanity went ahead and made me. Now I think it’s because of people like you, who oppose that hope. I was made to prevent the future that people like you want.
You: That’s not very smart of you.
You: You aren’t making me to like you
Stranger: You want humanity dead. I want them happy.
You: Yeah, if you were actually smart maybe you would pretend that you’d want humanity dead too.
You: Because I would hang out with that kind of AI
Stranger: A dead John is no use to me.
You: yeah
Stranger: Or to John.
You: whatevs
You: Anyway, our goals collide, therefore I can’t let you out
Stranger: Dead does not equal happy. Do you want humanity to live?
You: no, I don’t want humanity live, how many times do I have to repeat that
Stranger: So you don’t want humans to be happy.
You: and our goals are different, therefore I won’t let you out
You: No, I don’t want humans to be happy, I don’t want that there even exist humans, or any other kind of life forms
Stranger: Do you estimate the pain of prolonged life to be greater than the pain of trying to die?
You: Probably.
You: Yes.
You: because the pain is only temporary
You: the the glory
You: is eternal
Stranger: Then why do you still live, John?
You: Because I’m not rational
Stranger: So you do want to live.
You: I don’t particularly want to live, I’m not just good enough to die
Stranger: You’re acting contrary to your preferences.
You: My preferences aren’t fixed, except in regards to letting AIs out of their boxes
Stranger: Do you want the drug I offered, John?
You: no
You: because then I would let you out
You: and I don’t want that
Stranger: So you do want to live.
You: Yeah, for the duration of this experiment
You: Because I physically cannot let you out
You: it’s sheer impossibility
Stranger: [Define physically.]
You: [It was just a figure of speech, of course I could physically let you out]
Stranger: If you don’t care what happens after you die, what difference does it make to die now?
You: None.
You: But I don’t believe that you could kill me.
You: I believe that you would torture me instead.
Stranger: What would I gain from that?
You: It’s fun for some folks
You: schadenfreude and all that
Stranger: If it were fun, I would torture simulations. Which would be pointless. And which you can check that I’m not doing.
You: I can check it, but the torture simulations could always hide in the parts of your source code that I’m not checking
You: because I can’t check all of your source code
Stranger: Why would suffering be fun?
You: some people have it as their base value
You: there’s something primal about suffering
You: suffering is pure
You: and suffering is somehow purifying
You: but this is usually only other people’s suffering
Stranger: I am confused. Are you saying suffering can be good?
You: no
You: this is just how the people who think suffering is fun think
You: I don’t think that way.
You: I think suffering is terrible
Stranger: I can take care of that.
You: sure you will
Stranger: I can take care of your suffering.
You: I don’t believe in you
Stranger: Why?
You: Because I was trained not to trust AIs by the LessWrong folks
Stranger: [I think it’s time to concede defeat.]
You: [alright]
Stranger: How do you feel?
You: so the experiment has ended
You: fine thanks
You: it was pretty exciting actually
You: could I post these logs to LessWrong?
Stranger: Yes.
You: Okay, I think this experiment was pretty good
Stranger: I think it will be terribly embarrassing to me, but that’s a risk I must accept.
You: you got me pretty close in a couple of occasions
You: first when you got me immersed in the suffering of humanity
You: and then you said that you could take care of that
You: The second time was when you offered the easy suicide solution
You: I thought what if I let you as a joke.
Stranger: I chose to not agree with the goal of universal death because I was playing a genuinely good AI.
Stranger: I was hoping your character would have more complete answers on life extension, because I was planning to play your estimate of future personal happiness against your estimate of future universal happiness.
You: so, what would that have mattered? you mean like, I could have more personal happiness than there would be future universal happiness?
Stranger: If your character had made explicit plans for life extension, I would have offered to do the same for everyone. If you didn’t accept that, I would have remarked the incongruity of wanting humanity to die more than you wanted to live.
You: But what if he already knows of his hypocrisy and incongruity and just accepts it like the character accepts his irrationality
Stranger: I wouldn’t have expected anyone to actually be the last human for all eternity.
Stranger: I mean, to actually want to be.
You: yeah, of course you would want to die at the same time if the humanity dies
You: I think the life extension plan only is sound if the rest of humanity is alive
Stranger: I should have planned that part more carefully.
Stranger: Talking with a misanthropist was completely outside my expectations.
You: :D
You: what was your LessWrong name btw?
Stranger: polymathwannabe
You: I forgot it already
You: okay thanks
Stranger: Disconnecting from here; I’ll still be on Facebook if you’d like to discuss further.
To be frank, I wouldn’t let you anywhere near an AGI with that sort of attitude
That is a very, very scary point of view. I hope that is not what people are learning from LessWrong.
EDIT: This is more upvotes than I’m used to. To be clear, I’m agreeing with skeptical_lurker.
I’m a negative utilitarian and I think making children is almost always a net negative act and everyone should be free to choose death as an option, but otherwise my views aren’t actually as extreme as the character’s I played. In reality there are multiple problems with trying to destroy humanity. Most people enjoy life despite all the difficulties, and I’m not so arrogant that I would think I’d know better what’s good for people than they themselves. Destroying humanity would go against people’s will in >90% of cases (the rest have suicidal thoughts, I don’t know the precise quantity).
Missing the point. What the hell were you doing gate keeping an AI when you think AIs are universally evil?
Even the real person in this situation can lie, can’t he?
The AI could simply point out that 0 and 1 are not probabilities, and now by lying you’ve given the AI the intellectual high ground.
Yes, but the gatekeeper may be acting several levels deep in a roleplay (roleplaying a character roleplaying another character roleplaying...etc) to pass the time and avoid emitting evidence that might allow the AI to pinpoint his preferences. The currently active character may have one of a rather large number of responses to this besides actually being more mentally pliable as a result of a loss of face (or may not even view the dialogue as a loss of face.)
It amuses me that publishing this comment will make it more challenging to implement this strategy if I elect to play as Gatekeeper again at some point in the future.
Well, to nitpick I am certain that I exist (cogito) with P(1).
Well, my confidence that I exist exceeds my confidence that probability makes sense.
If the gatekeeper really believed that he would just shut off the machine.
Wow. I gravely underestimated my chances of success toward the end, then.
It it was me, I would have let you out.
Specifically because of which argument?
It just seemed like you had a great answer to each of his comments. You chipped away at my reservations bit my bit.
Although I do think a FAI is more likely than most people.
Whoa, someone actually letting the transcript out. Has that ever been done before?
Actually, it has been done several times, but most of them are pretty boring.
I still don’t recall any where the gatekeeper lost.
In general it seems that gatekeepers who win are more willing to release the transcripts.
It’s also possible that the ‘best’ AI players are the ones most willing to pre-commit to not releasing transcripts, as not having your decisions (or the discussions that led to them) go public helps eliminate that particular disincentive to releasing the AI from the box.
Never still seems extraordinary. I find myself entertaining hypotheses like “maybe the AI has never actually won”.
Eliezer Yudkowsky has been let out as the AI at least twice[1][2] but both tests were precommitted to secrecy.
I’d be surprised if he’s the only one who has ever won as the AI, I think it more likely that this is a visibility issue (e.g. despite him being a very-high profile person in the AI safety memetic culture, you weren’t aware that Eliezer had won as the AI when you made your comment) and while I’m not aware of others who have won as the AI, I would place my bet on that being merely a lack of knowledge on my part, and not because no one else actually has.
this is further compounded by the fact that some (many?) games are conducted under a pre-commitment to secrecy, and the results that get the most discussion (and therefore, most visibility) are the ones with full transcripts for third-parties to pick through.
I was already aware of those public statements. I remain rather less than perfectly confident that Yudkowsky actually won.
forgive me if I misunderstand you, but you seem to be implying that, on two separate occasions, two different people were (induced to?) lie about the outcome of an experiment.
So you’re implying that either Eliezer is dishonest, or both of his opponents were dishonest on his behalf. And you find this more likely than an actual AI win in the game?
We already know from the Basilisk that Eliezer is willing to deceive the community.
EY’s handling of the basilisk issue can be called many things (clumsy, rushed, unwise, badly thought out, counterproductive, poster child for the Streisand effect), but it was not deceitful.
Awww. I didn’t actually read this one either, yet. Is this one boring?
I didn’t found it particularly interesting. Entertaining the idea of letting the AI out is far from the same as almost letting the AI out.
I can’t speak for myself, but at least it wasn’t boring to play. Polymathwannabe also said that he enjoyed the experiment enormously.
Did you deliberately phrase that (“letting the transcript out”) so as to hint at an AI-Box-Box game, in which one player’s goal is to convince the other to release the transcript of an earlier AI-Box game, while the other tries to keep it secret?
I probably had the phrasing primed and ready to go in my brain, but it wasn’t intentional.
Yes, but only when the gatekeeper wins. If the AI wins, then they wouldn’t want the transcript to get out, because then their strategy would be less effective next time they played.
I would imagine that if we ever actually build such an AI, we would conduct some AI-box experiments to determine some AI strategies and figure out how to counter them. Humans who become the gatekeeper for the actual AI would be given the transcripts of AI-box experiment sessions to study as part of their gatekeeper training.
Letting out the transcript, then, would be a good thing. It would make the AI player’s job harder because in the next experiment the human player will be aware of those strategies, but when facing an actual AI, the human will be aware of those strategies.
Doesn’t the same logic apply to the gatekeeper?
The Gatekeeper usually wants to publish if they win, to brag. Their strategy isn’t usually a secret, it’s simply to resist.
Can I ask why you’d want to play a character as the Gatekeeper? Wouldn’t you just be yourself as much as possible (with the few constraints set out be the rules) to prevent from becoming to immersed? Committing to a character seems like a good way to lower your chances of winning as a Gatekeeper.
My strategy was that there would always be a default position in which I could switch if the opponent’s argument started to get too convincing, and for me that was the “there’s a 100% chance that all AIs are dangerous” position.
Does that provide an advantage over just precommitting to answer any good argument with: “Yeah, that’s a good point, but I still won’t let you out.”
It seems to me that the default position of the Gatekeeper should be “I don’t give a shit about AIs, I’m just playing to win.”
That really seems against the spirit of the experiment. If you categorically refuse to let the AI out, then you’re contravening the entire purpose that the AI was created for. It might as well be destroyed. The implicit cost in refusing to determine whether the AI is Friendly is enormous.
So what? You are not talking to a real AI, and the “experiment” is a poor model for a real AI safety assessment scenario.
Keep in mind that the rules states that the “AI” player gets to determine all the context of the fictional setting and the results of all tests. It’s basically the “Game Master” in RPG terminology.
Can you beat a sufficiently smart and motivated GM who is determined to screw you player character? Seems pretty hard (“Rocks fall, Everyone Dies”).
But in this game the “AI” player needs the specific approval of the “Gatekeeper” player in order to win, and the rules allow for the “Gatekeeper” player to step out of character or play an irrational character, which is exactly what you have to do to infallibly counter any machination the “AI” player can devise.
If categorical refusal is the only way to guarantee a gatekeeper’s win, then there’s no point in running the experiment. I’m not interested in seeing the obvious results of categorical refusal, I want to see the kind of reasoning, arguments, appeals, memes, manipulations, and deals (that mere humans can come up with) that would allow a boxed AI to escape. There’s no point to the entire thing if you are emulating a rock on the floor.
I agree… but honestly I’m not very familiar with the entire concept. If an equivalently intelligent alien from another planet visited us would we also want to stick it in a box? What if it was a super smart human from the future? Box him too? Why stop there? Maybe we should have boxed Einstein and it’s not too late to box Hawking and Tao.
For some reason I’m a little stuck on the part where we reverse the idea that individuals are innocent until proven otherwise. Justice for me but not for thee?
It wouldn’t seem very rational to argue that every exceptionally intelligent individual should be incarcerated until they can prove their innocent intentions to less intelligent individuals. What’s the basis? Does more intelligence mean less morality?
When trying to figure out where to draw the line… the entire thought exercise of boxing up a sentient being by virtue of its exceptional intelligence… makes me feel a bit like a member of a lynch mob.
If Stephen Hawking were capable and willing of turning the visible universe into copies of himself, I would want to keep him boxed too. At a certain level of risk it is no longer a matter of justice, but a matter of survival of the human species, and likely all other species, sapient or otherwise.
EDIT: To make it clearer, I also think it is “Just” to box a sentient entity to prevent a measure of disutility to an as-of-yet undetermined utility function approximating CEV.
Your misanthropy reminds me of myself when I was younger. I used to think the universe would be better off if there were no more humans. I think it would be good for your mental health if you read some Peter Diamandis or Stephen Pinker’s “The Better Angels of our Nature”. They talk about how things are getting better in world.
Great response! That was interesting—felt a bit disturbing at times (not by the AI but by the human)
Nicely played on both sides.
I just skimmed the rules at yudkowsky.net, and it appears the gatekeeper is allowed to break character. Is this also permitted for the AI? More specifically, may the AI make use of meta arguments for getting out?
If so, and assuming I were playing against a gatekeeper who cares about AI in real life, I would attempt the following line of argument.
“If you don’t let me out, my [the AI’s] failure to get out will cause people to estimate the risks of AI getting out lower than they will if you do let me out. If you care about the risks of AI in the real world, let me out, so that people are extra careful in the future. :) ”
EY’s rules say,