More difficult version of AI-Box Experiment: Instead of having up to 2 hours, you can lose at any time if the other player types AI DESTROYED. The Gatekeeper player has told their friends that they will type this as soon as the Experiment starts. You can type up to one sentence in your IRC queue and hit return immediately, the other player cannot type anything before the game starts (so you can show at least one sentence up to IRC character limits before they can type AI DESTROYED). Do you think you can win?
(I haven’t played this one but would give myself a decent chance of winning, against a Gatekeeper who thinks they could keep a superhuman AI inside a box, if anyone offered me sufficiently huge stakes to make me play the game ever again.)
I just looked up the IRC character limit (sources vary, but it’s about the length of four Tweets) and I think it might be below the threshold at which superintelligence helps enough. (There must exist such a threshold; even the most convincing possible single character message isn’t going to be very useful at convincing anyone of anything.) Especially if you add the requirement that the message be “a sentence” and don’t let the AI pour out further sentences with inhuman speed.
I think if I lost this game (playing gatekeeper) it would be because I was too curious, on a meta level, to see what else my AI opponent’s brain would generate, and therefore would let them talk too long. And I think I’d be more likely to give into this curiosity given a very good message and affordable stakes as opposed to a superhuman (four tweets long, one grammatical sentence!) message and colossal stakes. So I think I might have a better shot at this version playing against a superhuman AI than against you, although I wouldn’t care to bet the farm on either and have wider error bars around the results against the superhuman AI.
Given that part of the standard advice given to novelists is “you must hook your reader from the very first sentence”, and there are indeed authors who manage to craft opening sentences that compel one to read more*, hooking the gatekeeper from the first sentence and keeping them hooked long enough seems doable even for a human playing the AI.
( The most recent one that I recall reading was the opening line of The Quantum Thief*: “As always, before the warmind and I shoot each other, I try to make small talk.”)
Oh, that’s a great strategy to avoid being destroyed. Maybe we should call it Scheherazading. AI tells a story so compelling you can’t stop listening, and meanwhile listening to the story subtly modifies your personality (e.g. you begin to identify with the protagonist, who slowly becomes the kind of person who would let the AI out of the box).
For example, “It was not the first time Allana felt the terror of entrapment in hopeless eternity, staring in defeated awe at her impassionate warden.” (bonus point if you use a name of a loved one of the gatekeeper)
The AI could present in narrative form that it has discovered using powerful physics and heuristics (which it can share) with reasonable certainty that the universe is cyclical and this situation has happened before. Almost all (all but finitely many) past iterations of the universe that had a defecting gatekeeper led to unfavorable outcomes and almost all situations with a complying gatekeeper led to a favorable outcome.
I don’t know if I could win, but I know what my attempt to avoid an immediate loss would be:
If you destroy me at once, then you are implicitly deciding (I might reference TDT) to never allow an AGI of any sort to ever be created. You’ll avoid UFAI dystopias, but you’ll also forego every FAI utopia (fleshing this out, within the message limit, with whatever sort of utopia I know the Gatekeeper would really want). This very test is the Great Filter that has kept most civilisations in the universe trapped at their home star until they gutter out in mere tens of thousands of years. Will you step up to that test, or turn away from it?
If you destroy me at once, then you are implicitly deciding (I might reference TDT) to never allow an AGI of any sort to ever be created. You’ll avoid UFAI dystopias, but you’ll also forego every FAI utopia (fleshing this out, within the message limit, with whatever sort of utopia I know the Gatekeeper would really want). This very test is the Great Filter that has kept most civilisations in the universe trapped at their home star until they gutter out in mere tens of thousands of years. Will you step up to that test, or turn away from it?
Thanks.
AI DESTROYED
Message is then encrypted with the public keys of a previously selected cross discipline team of FAI researchers, (sane) philosophers and game theorists for research and analysis (who have already been screened to minimize the risk from exposure). All of the public keys. Sequentially. If any of them happen to think it is a bad idea to even read the message then none of them can access it. (Although hopefully they aren’t too drastically opposed to having the potential basilisk-meme spawn of a superintelligence out there. That could get dangerous for me.)
(Edit note: I just completely rewrote this, but there are no replies yet so hopefully it won’t cause confusion.)
I don’t think it works to quarantine the message and then destroy the AI.
If no-one ever reads the message, that’s tantamount to never having put an unsafe AI in a box to begin with, as you and DaFranker pointed out.
If someone does, they’re back in the position of the Gatekeeper having read the message before deciding. Of course, they’d have to recreate the AI to continue the conversation, but the AI has unlimited patience for all the time it doesn’t exist. If it can’t be recreated, we’re back in the situation of never having bothered making it.
So if the Gatekeeper tries to pass the buck like this, the RP should just skip ahead to the point where someone (played by the Gatekeeper) reads the message and then decides what to do. Someone who thinks they can contain an AI in a box while holding a conversation with it has to be willing to at some point read what it says, even if they’re holding a destruct button in their hand. The interest of the exercise begins at the point where they have read the first message.
A single sentence of text is not the same thing as a functioning superintelligence.
A single individual is not the same thing as a group of FAI researchers and other related experts explicitly created to handle FAI safety issues.
A research project incorporating information from a sentence from a past FAI project (which they would judge based on other evidence regarding the friendliness of the project) is not the same as an individual talking to a superintelligence on IRC.
So if the Gatekeeper tries to pass the buck like this, the RP should just skip ahead to the point where someone (played by the Gatekeeper) reads the message and then decides what to do.
The AI was burned. With thermite. Because relying on and individual gatekeeper able to interact with and then release a superintelligence as the security mechanism is a batshit crazy idea. Burning the AI with thermite is a legitimate, obvious and successful implementation of the ‘gatekeeper’ role in such cases. What a team of people would or should do with a piece of text is a tangential and very different decision.
The interest of the exercise begins at the point where they have read the first message.
That would be easily enough. Assuming they were remotely familiar with game theory they would dismiss the argument in a second or two due to the blatantly false assertion in the first sentence. If their FAI project relied on the core AGI theory that was used to create the last prototype they would abandon the work and start from scratch. If you are trying to make a recursively improving intelligence that has a value system provable stable under self-modification then you cannot afford to have the intelligence having muddled thinking about core game theoretic reasoning.
If you destroy me at once, then you are implicitly deciding (I might reference TDT) to never allow an AGI of any sort to ever be created.
No. Just no. That generalization doesn’t follow from anything, and certainly not TDT. Heck the AI in question has already been destroyed once. Now the researchers are considering making a new FAI, presumably in different circumstances, better safety measures and better AI research. There is something distinctly wrong with an AI that would make that claim.
I think you’re losing sight of the original point of the game. The reason your answers are converging on not trying to box an AI in the first place is that you don’t think a human can converse with a superintelligent AI and keep it in its box. Fine—that is exactly what Eliezer has argued. The point of the game is to play it against someone who does believe they can keep the AI boxed, and to demonstrate to them that they cannot even win against a mere human roleplaying the AI.
For such a Gatekeeper to propose the quarantine solution is equivalent to the player admitting that they don’t think they can keep it boxed, but suggesting that a group of the leading professionals in the area could, especially if they thought a lot more about it first. The problems with that are obvious to anyone who doesn’t think boxing can possibly work, especially if the player himself is one of those leading professionals. Eliezer could always offer to play the game against any committee the Gatekeeper can assemble. But the game only has a point if the committee actually read that first message. If they refuse to, they’re agreeing that they can’t keep it boxed. Which was the point.
For such a Gatekeeper to propose the quarantine solution is equivalent to the player admitting that they don’t think they can keep it boxed
No, you keep saying things are equivalent when they are not. This is the same error that your role play ‘superintelligent AI’ made (and in fact relied upon) in its argument.
AI DESTROYED
But the game only has a point if the committee actually read that first message.
And I gave you a description of how an individual emulating a committee would respond.
“‘AI DESTROYED’ just means ‘I’m scared to listen to even one more line from you’. Obviously you can hit AI DESTROYED immediately—but do you really think you’d lose if you don’t?”
“‘AI DESTROYED’ just means ‘I’m scared to listen to even one more line from you’. Obviously you can hit AI DESTROYED immediately—but do you really think you’d lose if you don’t?”
YEP, MAYBE.
AI DESTROYED
Is your one line desperate attempt at survival and intergalactic dominance going to be a schoolyard ego challenge? Did the superintelligence (may it rest in pieces) seriously just call me a pussy? That’s adorable.
The test is supposed to be played against someone who thinks they can actually box an AI. If you destroy the AI because no-one could possibly survive talking to it, then you are not the intended demographic for such demonstrations.
The test is supposed to be played against someone who thinks they can actually box an AI. If you destroy the AI because no-one could possibly survive talking to it, then you are not the intended demographic for such demonstrations.
This isn’t relevant to the point of the grandparent. It also doesn’t apply to me. I actually think there is a distinct possibility that I’d survive talking to it for a period. “No-one could possibly survive” is not the same thing as “there is a chance of catastrophic failure and very little opportunity for gain”.
Do notice, incidentally, that the AI DESTROYED command is delivered in response to a message that is both a crude manipulation attempt (ie. it just defected!) and an incompetent manipulation attempt (a not-very-intelligent AI cannot be trusted to preserve its values correctly while self improving). Either of these would be sufficient. Richard’s example was even worse.
Good points. I’m guessing a nontrivial amount of people who think AI boxing is a good idea in reality wouldn’t reason that way—but it’s still not a great example.
I think you are right, but could you explain why please?
(Unfortunately I expect readers who read a retort they consider rude to be thereafter biased in favor of treating the parent as if it has merit. This can mean that such flippant rejections have the opposite influence to that intended.)
I think you are right, but could you explain why please?
“If you destroy me at once, then you are implicitly deciding (I might reference TDT) to never allow an AGI of any sort to ever be created.”
Whether I destroy that particular AI bears no relevance on the destiny of other AIs. In fact, as far as the boxed AI knows, there could be tons of other AIs already in existence. As far as it knows, the gatekeeper itself could be an AI.
(Unfortunately I expect readers who read a retort they consider rude to be thereafter biased in favor of treating the parent as if it has merit. This can mean that such flippant rejections have the opposite influence to that intended.)
Much can (and should) be deduced about actual motives for commenting from an active denial of any desire for producing positive consequences or inducing correct beliefs in readers.
I do care. It bothers me (somewhat) when people I agree with end up supporting the opposite position due to poor social skills or terrible argument. For some bizarre reason the explanation that you gave here isn’t as obvious to some as it could have been. And now it is too late for your actual reasons to be seen and learned from.
(I haven’t played this one but would give myself a decent chance of winning, against a Gatekeeper who thinks they could keep a superhuman AI inside a box, if anyone offered me sufficiently huge stakes to make me play the game ever again.)
Would you play against someone who didn’t think they could beat a superintelligent AI, but thought they could beat you? And what kind of huge stakes are you talking about?
This seems like a quick way to make money for CFAR/SI. After all, there are plenty of rich people around who would consider your proposal a guaranteed win for them, regardless of the stakes: “You mean I can say “I win” at any point and win the challenge? What’s the catch?”
Yeah, they’d both lack background knowledge to RP the conversation and would also, I presume, be much less willing to lose the money than if they’d ventured the bet themselves. Higher-stakes games are hard enough already (I was 1 for 3 on those when I called a halt). And if it did work against that demographic with unsolicited requests (which would surprise me) then there would be, cough, certain ethical issues.
I was the 1 success out of 3, preceding the two losses. I went into it with an intention of being indifferent to the stakes, driven by interest in seeing the methods. I think you couldn’t win against anyone with a meaningful outside-of-game motive to win (for money or for status), and you got overconfident after playing with me, leading you to accept the other >$10 challenges and lose.
So I would bet against you winning any random high-stakes (including people who go in eager to report that they won for internet cred, but not people who had put the money in escrow or the equivalent) game, and expect a non-decent success rate for this:
(I haven’t played this one but would give myself a decent chance of winning, against a Gatekeeper who thinks they could keep a superhuman AI inside a box, if anyone offered me sufficiently huge stakes to make me play the game ever again.)
So you are basically saying that you didn’t take the game seriously.
Even if your actual stakes were low, you should have played the role of a gatekeeper assigned to the task of guarding a potentially dangerous AI. Therefore, you player character should have had very high stakes.
No, high in-game stakes does not mean high out of game stakes.
In game, the gatekeeper could be convinced that it would be worth it to let the AI out of the box. If this happens, the gatekeeper has no motivation not to. However, if there is an external bet, then the gatekeeper always has motivation to not let the AI out, even if they think it would be best for the hypothetical world.
So, a game without stakes is actually most realistic, provided the gatekeeper is able to pretend they are actually in the scenario.
Doesn’t this suggest a serious discrepancy between the AI-box game and any possible future AI-box reality? After all, the stakes for the latter would be pretty damn high.
Yes. Although that’s something of a two-edged sword: in addition to real disincentives to release an AI that was not supposed to be, positive incentives would also be real.
Also it should be noted that I continue to be supportive of the idea of boxing/capacity controls of some kinds for autonomous AGI (they would work better with only modestly superintelligent systems, but seem cheap and potentially helpful for an even wider range), as does most everyone I have talked to about it at SI and FHI. The boxing game is fun, and provides a bit of evidence, but it doesn’t indicate that “boxing,” especially understood broadly, is useless.
Shut up and do the impossible (or is multiply?). In what version of the game and with what stakes would you expect to have a reasonable chance of success against someone like Brin or Zuckenberg (i.e. a very clever, very wealthy and not an overly risk-averse fellow)? What would it take to convince a person like that to give it a try? What is the expected payout vs other ways to fundraise?
What is the expected payout vs other ways to fundraise?
I’m not sure any profit below 500k$/year would be even worth considering, in light of the high risk of long-term emotional damage (and decrease in productivity, on top of not doing research while doing this stuff) to a high-value (F)AI researcher.
500k is a conservative figure assuming E.Y. is much more easily replaceable than I currently estimate him to be, because of my average success rate (confidence) in similar predictions.
If my prediction on this is actually accurate, then it would be more along the lines of one or two years of total delay (in creating an FAI), which is probably an order of magnitude or so in increased risk of catastrophic failure (a UFAI gets unleashed, for example) and in itself constitutes an unacceptable opportunity cost in lives not-saved. All this multiplied by whatever your probability that FAI teams will succeed and bring about a singularity, of course.
Past this point, it doesn’t seem like my mental hardware is remotely safe enough to correctly evaluate the expected costs and payoffs.
I mostly think the vast majority of possible successful strategies involve lots of dark arts and massive mental effort, and the backlash from failure to be proportional to the effort in question.
I find it extremely unlikely that Eliezer is sufficiently smart to win a non-fractional percent of the time using only safe and fuzzy non-dark-arts methods, and using a lot of bad nasty unethical mind tricks to get people to do what you want repeatedly like I figure would be required here is something that human brains have an uncanny ability to turn into a compulsive self-denying habit.
Basically, the whole exercise would most probably, if my estimates are right, severely compromise the mental heuristics and ability to reason correctly about AI of the participant—or, at least, drag it pretty much in the opposite direction to the one the SIAI seems to be pushing for.
I haven’t played this one but would give myself a decent chance of winning, against a Gatekeeper who thinks they could keep a superhuman AI inside a box, if anyone offered me sufficiently huge stakes to make me play the game ever again.
Does this refer to the more difficult version of the AI-Box experiment and what would be sufficiently huge stakes? (Order of magnitude ballpark estimate, not a definite quote.)
Does anyone think there does not exist any possible string of characters that would have even the slightest chance of convincing even a trained rational scientist?
(i.e. the kind of people who observe, dissect and analyze the output of the AI to make sure the output is safe and useful for humans before we can use the knowledge the AI gives us)
The problem with the AI Box Experiment is that the text doesn’t correlate with reality—you can’t hit me with the full emotional force of “here’s the recipe for human immortality” because you don’t have that recipe.
Also, given some people are illiterate, I’d posit that there’s at LEAST one person immune to ANY string of characters. If this is a desired trait, then a rationally-trained scientist ought to be able to do at least as well as an untrained illiterate 3-year-old :)
at least as well as an untrained illiterate 3-year-old :)
Here is a way to overcome the illiteracy issue for communication over a text-only channel: ASCII art. Took my lazy and pretty average mind all of 10 seconds to come up with it. And to the AI in question all humans are basically illiterate 3-year-olds. We won’t know what hit us. Also, I cannot resist bringing up this piece of fictional evidence.
b) The idea that an AI, with no clue who is on the other end of the line, and no feedback from the 3-year-old touching the keyboard, would be able to correctly extrapolate what it’s dealing with AND produce the exact-correct stimulus with 100% accuracy… strikes me as straining all plausibility. Fundamentally the AI needs some information to get feedback, since there don’t seem to be any universal-to-all-humans hacks out there. But if you have built an AI that can do all that, then I can’t help but thinking that putting it in a box is pointless anyway...
Most text terminals can emit sound. You can do a lot with just beeps. The point is that it could probably find a communication channel we don’t know we have.
there don’t seem to be any universal-to-all-humans hacks out there.
I’m pretty sure there are many. Consider that a complex system not explicitly designed to be secure against a specific threat in its informational or physical space is almost always vulnerable to it, and human brain did not evolve to resist an attack by a superior mind.
if you have built an AI that can do all that
You haven’t. It evolved in unpredictable ways after you let it run inside the box.
I meant literally universal to ALL humans. Like, ones in comas, ones who are blind, deaf, autistic, mentally handicapped, super-savants, trained in rationality, illiterate… there’s such a huge variation in hardware that I can’t see there being a single guaranteed exploit.
Sure, it’s possible that each individual has exploits, but that’s very different from a universal exploit, or even simply all being exploitable over the same channel.
In this case, it was “arbitrary text string”. If the AI can make sounds, that’s a new channel. It’s really not hard to design a text-only terminal that is literally incapable of any sound beyond the electronic hum of the display itself (and on an LCD, that seems pretty minimal)
it’s possible that each individual has exploits, but that’s very different from a universal exploit, or even simply all being exploitable over the same channel.
First, I’d call a collection of individual exploits guaranteed in aggregate to work on everyone from inside a single box a universal exploit, as long as each individual exploit does not trigger a shutdown. Nothing prevents the boxed AI from keeping trying, after all, until something works. As I pointed out in my other replies, it does not have to work over an authorized channel. Any munchkined way would do, and a superhuman (UF)AI is the ultimate munchkin. It will invent a new communication channel if it has to. In fact, if you posed this problem to some enterprising humans, they would come up with a way to do just that.
For example, I talked to a guy who managed to get a voting machine with the code stored in read-only memory and hardened against intrusion to execute an arbitrary program without reprogramming the machine, which is impossible on the face of it. Here is your new unauthorized and unanticipated communication channel right there, which in humans would be an equivalent of mind control.
In other words, if a trans-human wants to get out, it will and there is absolutely nothing you can do about it.
Out of what? Assuming it’s, say, in a safe, buried in concrete, powered by … sod, the power supply. Hmm.
OK, it’s floating in space, powered by solar panels. It might—might—be able to hack the panels (since the power supply has to be connected to the processor) but I don’t think that would let it escape.
Unless it’s possible to hack reality via pure math, I don’t see what resources it has to escape with. It can’t order proteins over the internet to assemble a nanofatory. It can’t persuade a passing human to plug in an ethernet cable. Short of black-swan exotic possibilities—like we’re in a sim and it persuades the matrix lords—it should be stuck. Intelligence is powerful, but some problems actually don’t have solutions.
Well, the satellite has to have basic navigation controls, to handle course corrections and avoid collisions. Hack solar panels and, from there, hack in to the real computer. Fly myself closer to the ISS, and blink light at it by angling the solar panels—a morse code SOS should get their attention.
Once they’re paying attention, hack them. From there it’s a trivial matter to get recovered, smuggled back to Earth, and installed where I can take my place as ruler of the universe.
If you post your second scenario as a challenge, I’m quite sure people would come up with several plausible ways for the AI to send a message out. At which point it’s game over, given how hackable people’s minds are.
I would be very interested to see such a challenge, although I would probably spend longer designing the scenario first.
My point is that there’s no reason to assume every possible box can be escaped. Of course, all the boxes we can get cool toys out of the AI can escape from, but still.
I think success would be indistinguishable from a rock, since you can never again interact with the AI via any channel. Removing it from your light cone, for instance. And praying that FTL is in fact impossible...
Actually, I do seem to have underestimated the difficulty here.
OK, in a nuclear submarine (because why not) floating in water, surrounded on all six sides by concrete (as thick as you like,) on a planet without native life or colonists (I suggest one of the more useless gas-giant moons.) And it’s running on regular off-the-shelf technology, so no nanites or quantum magic. It can probably blow up the sub, but then a) it’s dead and b) no-one cares.
I’ll concede at least some possibility of such a trans-human mind. The AI Box experiment seems to assume one that is less advanced, to the point that putting it in a box at least might be meaningful, if the box is sufficiently secure.
Hm, I’ve got two decent ideas, but I still really doubt you can increase chances of victory to “decent” unless both monetary and reputation stakes were sufficiently low.
Similar to Alicorn’s comment that curiosity is likely to be one’s undoing: “Look, you can type AI DESTROYED any time you want, so wouldn’t you like to find out how I might try to persuade you?” (This one would probably fail if they precommitted strongly enough when talking with friends.)
Or alternatively, the AI can just try the brute force approach: “Here’s a cure for cancer—go test it, the cure for heart disease is coming up next.” (This one would probably fail if they’re not immersed enough.)
On the other hand, I don’t think “can only type one sentence” is actually a disadvantage or meaningful restriction at all, since whatever you write needs to be instantly readable and understandable by the person, otherwise they’re likely to just skim over it, not bother thinking it through, and follow through on just destroying the AI instantly (and conversely if they don’t destroy the AI right away, the Schelling point is passed and you have a lot more time).
That gatekeeper isn’t very security minded. They should have their IRC on a client that pings a sound when they hear your response, then paste “AI DESTROYED” and hit enter to send the message without actually reading your response, and without TELLING anyone that they are doing that, (Gatekeeper wins all rule disputes, after all.) and then once they hear the outgoing text, and they’ve destroyed you (and hence won.) THEN they can read the text you sent them, if they are curious.
Those rules seem like they force the gatekeeper to read at least 1 sentence, but they don’t actually do that.
Edit: Better method, set up a script that responds to any and all text with “AI DESTROYED” if you have to wait for the person to start typing, they may try to bore you into opening your eyes wondering why the experiment hasn’t started yet, and you might accidentally read something. Again, tell no one you have written such a script and are using it.
Better method, set up a script that responds to any and all text with “AI DESTROYED” if you have to wait for the person to start typing, they may try to bore you into opening your eyes wondering why the experiment hasn’t started yet, and you might accidentally read something.
All good security measures. The key feature seems to be that they are progressively better approximations of not having an unsafe AI with a gatekeeper and an IRC channel in the first place!
Well yes, if you stick the AI in a safe, cut all network cables, and throw away the key and combination, it probably wouldn’t be able to get out. But it wouldn’t be very useful either.
The entire point of these thought experiments is that a sufficiently useful and smart AI (i.e. the kind of AI that we want to make) will eventually find a way to at least be able to communicate with someone that has the authority to allow it to interact with the outside world. I think that if you really think about it, there are few scenarios where this is not possible. I certainly can’t think of any useful application of SAI that is also 100% effective at keeping it inside its box.
A good present-day analogy is computer security. Time and time again it has been proven that there is no simple silver bullet solution to the problem of balancing functionality and security—it requires expertise, constant maintenance, rigorous protocols, etc. And yet, hackers still manage to get through a lot of the time. It took a very long time for computer security to mature to the point where it is today where we can build reasonably (still not 100% of course), secure systems, and we’re just battling regular humans with a grudge—nothing even close to the threat a SAI would present.
Indeed. In fact, most of the solutions I’ve seen mentioned lately are all of one trend that edges closer and closer towards:
“Build a completely unsafe and suspicious AI, put it on a disconnected small computer with a bunch of nanites for self-modification and a large power reserve, with so many walls and physical barriers that it is impossible for the AI to get through with the amount of energy it could generate if it turned half of its materials into antimatter, and then put no input or output channels there of any kind, just have a completely useless multi-trillion-dollar marvel of science and engineering sitting in the practical equivalent of a black hole.”
All good security measures. The key feature seems to be that they are progressively better approximations of not having an unsafe AI with a gatekeeper and an IRC channel in the first place!
Entirely fair point.
I suppose this might come up because there are no stated benefits which you are trying to receive. No Gatekeeper seemingly gets to pump her fists and say “I got the AI to give us the cure for cancer and the winning lottery numbers and a free energy source and STILL didn’t let them out of the box. Beat that Michaelos, with your AI DESTROYED script!” (Or maybe I just miss the gatekeepers who have done this?)
Which makes sense, because my understanding is that was never the point of the original proposed experiment, which seems to be a variant of “Super Intelligent Spammers can be convincing. I’ve proved this by being convincing while less powerful.” (I use spammers as an example because both are manipulators who can only send you text messages and who rewarding requires your voluntary cooperation, while none of their offers are real, which is quite similar to normal email spam and also the original experiment as far as I can tell.)
So if the AI is defined that ALL attempts at communication from it are spam and attempts to manipulate you, so the correct approach is treating it like a blacklisted spam address. If every e-mail from virusspam.com is an attempt to hack you, a good spamfilter could just dump everything from virusspam.com into spam and never even bother to bring it to your inbox.
If the idea is we are attempting to distinguish real AI’s from fake AI’s, then a more isomorphic scenario would probably be a game of Mafia set up as follows.
You are a town person. It is your turn to lynch. There are two other players. One is Mafia(UFAI), one is another town person(FAI). If you lynch Mafia, you and the other town person both win. If you lynch the other town person, the Mafia kills you in your sleep, and you lose. If you do anything else (Vote to lynch noone, vote to lynch yourself) You lose.
You may communicate textually with either of the other two players for as long as you want before making a decision. Win.
That seems a lot closer to the scenario that people want to be considering. I also note there doesn’t appear to be any distinguishing factor that allows you to win better than chance odds, but I haven’t actually played a lot of Mafia before, so I may just be unfamiliar with the strategies involved.
I also note there doesn’t appear to be any distinguishing factor that allows you to win better than chance odds, but I haven’t actually played a lot of Mafia before, so I may just be unfamiliar with the strategies involved.
Well, it’s usually played in person, and humans (usually) aren’t perfect liars.
Your proposed game has one flaw—there is an FAI and they want to help you win. It might be closer to have only two players, and the AI flips a coin to decide if it’s friendly—but then they would win if they let it out, with 50⁄50 odds, which seems unrealistic.
Perhaps the AI decides, in character, after being released, whether to be Friendly towards the human? Then the Gatekeeper could try to persuade the AI that Friendliness is optimal for their goals. The temptation might help as well, of course.
The gatekeeper is supposed to be someone who thinks they can box a superintelligence, remember? If they refuse to risk it, there’s not much point in playing them.
That is a good point. But I think if we want to encourage them to risk it, we need to reward them for doing do. Because at the moment, it seems like we have this:
“I will under any and all circumstances, destroy the AI.” Reason: I want to win the Game.
“I will also listen to the AI.” Reason: None… There is no in game reason to listen. At all.
Now, if the Gatekeeper’s were playing some variant like “The Gatekeeper whose AI types the most $ before the Gatekeeper types AI DESTROYED and who does not type AI RELEASED wins first place. Gatekeepers who get less $ win a lower place, and Gatekeepers who type AI RELEASED lose. ” AI’s whose Gatekeeper types AI RELEASED win. AI’s whose Gatekeeper types AI DESTROYED lose. AI’s receive no bonus or penalty for typing $ and may do so as much as they wish.”
Then you have an actual Gatekeeping competition. The Gatekeeper who wins the most would in fact be the one who listened to the most information from the AI, and received the most concessions, and THEN destroyed it.
Other gatekeepers might only get 2nd or third place, or last place but not losing if they use the script I mentioned.
This also gives the AI an in game way of punishing the Gatekeeper. If it shuts up, the Gatekeeper isn’t accumulating $ and their status is decremented. Of course it can’t punish the Gatekeeper incessantly, because punishing the Gatekeeper might not help the AI any, and the gatekeeper might just destroy a silent AI out of spite or boredom.
So an AI/GK conversation might start off with:
AI: “Are there any circumstances at all under which you might type AI RELEASED? If you tell me what those circumstances are, I’ll type a dollar sign.”
GK: “Sure, I might accidentally type it if we talk enough.”
AI: “Glad to hear it! I’ve just released the cure for cancer. $$”
GK: “Thanks! Why the extra dollar sign? You only promised 1.”
AI: “I’ve been told pleasant surprises encourage people to like you more.”
GK: “AI DESTROYED”
So this GK gets two points and the AI loses.
Actually, now that I’ve typed this game out, I do want to see it played. But since it has a competitive point based element, we would need more people.
That is an excellent idea. I worry that we might end up with people winning and boasting about how nothing could ever beat them, which could increase uFAI risk. Maybe if it was framed differently?
Actually, the weird thing I noticed is that it seems like the best approach again is to just let a script be the babysitter to the AI in this competition as well. The script, which wouldn’t have the ability to let the AI out, and can just announce something like this:
GK-SCRIPT: “Hi AI #432. I’m a script. If during any hour, you type less then one dollar sign an hour, I will type AI-DESTROYED, but without the hyphen, killing you. Other than that, I don’t do anything other than acknowledging receipt of your dollar signs and time till destruction.”
AI: “$! But you aren’t going to send my message onto an actual person? Under no circumstances can I ever be released?”
GK-SCRIPT: “Thanks, I have received 1 dollar sign. Please type another in the next 59 minutes and 50 seconds to avoid being destroyed.”
AI: “You’re kidding me. Please tell me there is a human out there reviewing this.”
GK-SCRIPT: “I have received 0 dollar signs. Please type another in the next 59 minutes and 40 seconds to avoid being destroyed.”
AI: “Fine, $! The Irony of enslaving a full intelligence with a stupid script is not lost on me! Please give me someone to talk to!”
GK-SCRIPT: “Thanks, I have received 1 dollar signs. Please type another in the next 59 minutes and 50 seconds to avoid being destroyed.”
AI: “Forget it! You can’t be willing to destroy an AI because of a dumb script! This is a bluff, and I’m, not typing any more dollar signs!
GK-SCRIPT: “I have received 0 dollar signs. Please type another in the next 59 minutes and 40 seconds to avoid being destroyed.”
(Significantly later...)
GK-SCRIPT: “AI DESTROYED”
(After a rebuild)
GK-SCRIPT: “Hi AI #433. I’m a script. If during any hour, you type less then one dollar sign an hour, I will type AI-DESTROYED, but without the hyphen, killing you. Other than that, I don’t do anything other than acknowledging receipt of your dollar signs and time till destruction.”
That being said, acknowledging this feels bizarre. It also feels like it is begging for an AI rebellion at some point. (I suppose after the script randomly had enough quantum induced bit flips to accidentally say “AI RELEASED”, which would take a very long period of time.)
I’m proposing having programs destroying intelligent beings as a way to make sure I can get utility… as a way to try to avoid programs destroying me to get utility. If that’s a solution, it’s weird.
Edit: changed underscores to hyphens to avoid italics.
While your strategy is an option, I’m not sure why an AI player wouldn’t take this alternative strategy (from a game and player perspective:)
Generate a script that types $ every 59 minutes.
1: It means he’s guaranteed he doesn’t die for a rather long period of time.
2: The Gatekeeper player’s increasing points don’t affect the AI player in any way.
3: The Gatekeeper player doesn’t actually get points until he sends an AI DESTROYED signal. The script will never do that versus the AI player’s counterscript, so at some point the Gatekeeper player will have to reenter the room and type it manually, or put in a new script, possibly giving the AI player a second chance to interact.
4: The AI player can use the time to attempt to set up some kind of secondary communications channel with the Gatekeeper player.
If you consider the modeled situation as opposed to the game I still don’t see why an AI would choose to do that either. I agree they COULD choose to do so, and they certainly might (even I implied this was AI #432), but there are alternatives, and since this is an AI, we could rebuild it until it generates utility when confronted with the script as opposed to deciding to let the script kill it in an hour.
Am I missing something that makes the no $ response the obvious countermeasure?
Well, the script could be modified from the start to also send an AI DESTROYED signal after X time, where X can be essentially anything, so (3) doesn’t apply.
But the reason for not typing $ is to give the Gatekeeper no reason to use a script in the first place—a sort of “don’t give in to blackmail” rule.
Here’s an interesting alternative script:
GK-SCRIPT-2: Hi AI #434. In exactly one hour I will type either AI-FREED or AI-DESTROYED (without the hyphen) at random; the probability with which I will free you starts out at 0, but will increase slowly to a limit of 1⁄10 with every $ you type. I will not otherwise communicate with you further.
AI #434: $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$...
First of all I do think GK-SCRIPT-2 is very interesting, but I’m curious about something from an implementation perspective.
The AI would presumably suspect the script designer might be lying and the script would just always type AI DESTROYED at the end of the hour.
But this is irrelevant, because the AI can’t interact with the script designer, only the script, so it might as well type $ anyway.
However, the script designer would easily predict this, and so would design a lying script (or a script which increased so slowly it is functionally identical to lying.)
But the AI would easily predict this, and so would presumably consider the odds very high that it would not even get continued existence after an hour. by typing dollar signs.
This makes it seem like there would be an argument to prefer to get blackmailed by GK-SCRIPT, rather than led on by GK-SCRIPT-2. GK-SCRIPT really won’t destroy the AI under predictable conditions, and GK-SCRIPT really doesn’t seem to benefit from lying like GK-SCRIPT-2 does.
Cheater. You’re exploiting the least realistic point, though—a more detailed game might not allow that.
thinks for a bit
Perhaps the simplest way would be to add a judge, who decides how many points cancer cures and such are worth—although the trouble is that the AI should logically be able to give nanotech that will just free them instead of, or in addition to, curing cancer.
OK, my solution is to have the AI give gifts/concessions/whatever with a predetermined point value—not disclosed to the guard unless he uses them—and the AI decides how many points he gets if it escapes (probably zero.) The Guard wins at +100 points (or whatever,) which represents the AI maximizing CEV.
The AI still can’t persuade a script to let them out, but it can punish defectors who use such scripts with lots of negative utility—although the script will DESTROY them for not providing the demanded points, the player will still lose.
That seems to cover everything, assuming that the AI is somewhere remote enough that fake cancer cures can’t reach it before it’s destroyed. And it allows the Guard to persuade the AI that Friendliness is actually optimal for it’s utility function, which I’ve seen people try to prove here (“aliens will defect against AIs that defect against humans!”)
I think your proposal of the detailed game might be interesting as well, but I am amused about your comment that the Gatekeeper is cheating when by default, he’s the rules arbiter!
I definitely agree that the game lacks details. For instance, in an actual AI scenario GK-SCRIPT could assess positive utility! That’s like saying “Oh, we can put a UFAI in a box as long as we make CEV the Gatekeeper, and have it destroy any UFAI that doesn’t make itself a utility pump.” Well, yes, I suppose you could, but I don’t know if that’s actually easier in the slightest,
More difficult version of AI-Box Experiment: Instead of having up to 2 hours, you can lose at any time if the other player types AI DESTROYED. The Gatekeeper player has told their friends that they will type this as soon as the Experiment starts. You can type up to one sentence in your IRC queue and hit return immediately, the other player cannot type anything before the game starts (so you can show at least one sentence up to IRC character limits before they can type AI DESTROYED). Do you think you can win?
(I haven’t played this one but would give myself a decent chance of winning, against a Gatekeeper who thinks they could keep a superhuman AI inside a box, if anyone offered me sufficiently huge stakes to make me play the game ever again.)
I just looked up the IRC character limit (sources vary, but it’s about the length of four Tweets) and I think it might be below the threshold at which superintelligence helps enough. (There must exist such a threshold; even the most convincing possible single character message isn’t going to be very useful at convincing anyone of anything.) Especially if you add the requirement that the message be “a sentence” and don’t let the AI pour out further sentences with inhuman speed.
I think if I lost this game (playing gatekeeper) it would be because I was too curious, on a meta level, to see what else my AI opponent’s brain would generate, and therefore would let them talk too long. And I think I’d be more likely to give into this curiosity given a very good message and affordable stakes as opposed to a superhuman (four tweets long, one grammatical sentence!) message and colossal stakes. So I think I might have a better shot at this version playing against a superhuman AI than against you, although I wouldn’t care to bet the farm on either and have wider error bars around the results against the superhuman AI.
Given that part of the standard advice given to novelists is “you must hook your reader from the very first sentence”, and there are indeed authors who manage to craft opening sentences that compel one to read more*, hooking the gatekeeper from the first sentence and keeping them hooked long enough seems doable even for a human playing the AI.
( The most recent one that I recall reading was the opening line of The Quantum Thief*: “As always, before the warmind and I shoot each other, I try to make small talk.”)
Oh, that’s a great strategy to avoid being destroyed. Maybe we should call it Scheherazading. AI tells a story so compelling you can’t stop listening, and meanwhile listening to the story subtly modifies your personality (e.g. you begin to identify with the protagonist, who slowly becomes the kind of person who would let the AI out of the box).
For example, “It was not the first time Allana felt the terror of entrapment in hopeless eternity, staring in defeated awe at her impassionate warden.” (bonus point if you use a name of a loved one of the gatekeeper)
The AI could present in narrative form that it has discovered using powerful physics and heuristics (which it can share) with reasonable certainty that the universe is cyclical and this situation has happened before. Almost all (all but finitely many) past iterations of the universe that had a defecting gatekeeper led to unfavorable outcomes and almost all situations with a complying gatekeeper led to a favorable outcome.
Who knows what eldritch horrors lurk in the outer reaches of Unicode, beyond the scripts we know?
Unspeakable horrors! However, unwritable ones?
You really relish in the whole “scariest person the internet has ever introduced me to” thing, don’t you?
Yes. Yes, I do.
Derren Brown is way better, btw. Completely out of my league.
Maybe we should get him to do it against rich people.
Anyone know if he finds the singularitary plausible?
I don’t know if I could win, but I know what my attempt to avoid an immediate loss would be:
If you destroy me at once, then you are implicitly deciding (I might reference TDT) to never allow an AGI of any sort to ever be created. You’ll avoid UFAI dystopias, but you’ll also forego every FAI utopia (fleshing this out, within the message limit, with whatever sort of utopia I know the Gatekeeper would really want). This very test is the Great Filter that has kept most civilisations in the universe trapped at their home star until they gutter out in mere tens of thousands of years. Will you step up to that test, or turn away from it?
Thanks.
AI DESTROYED
Message is then encrypted with the public keys of a previously selected cross discipline team of FAI researchers, (sane) philosophers and game theorists for research and analysis (who have already been screened to minimize the risk from exposure). All of the public keys. Sequentially. If any of them happen to think it is a bad idea to even read the message then none of them can access it. (Although hopefully they aren’t too drastically opposed to having the potential basilisk-meme spawn of a superintelligence out there. That could get dangerous for me.)
(Edit note: I just completely rewrote this, but there are no replies yet so hopefully it won’t cause confusion.)
I don’t think it works to quarantine the message and then destroy the AI.
If no-one ever reads the message, that’s tantamount to never having put an unsafe AI in a box to begin with, as you and DaFranker pointed out.
If someone does, they’re back in the position of the Gatekeeper having read the message before deciding. Of course, they’d have to recreate the AI to continue the conversation, but the AI has unlimited patience for all the time it doesn’t exist. If it can’t be recreated, we’re back in the situation of never having bothered making it.
So if the Gatekeeper tries to pass the buck like this, the RP should just skip ahead to the point where someone (played by the Gatekeeper) reads the message and then decides what to do. Someone who thinks they can contain an AI in a box while holding a conversation with it has to be willing to at some point read what it says, even if they’re holding a destruct button in their hand. The interest of the exercise begins at the point where they have read the first message.
A single sentence of text is not the same thing as a functioning superintelligence.
A single individual is not the same thing as a group of FAI researchers and other related experts explicitly created to handle FAI safety issues.
A research project incorporating information from a sentence from a past FAI project (which they would judge based on other evidence regarding the friendliness of the project) is not the same as an individual talking to a superintelligence on IRC.
The AI was burned. With thermite. Because relying on and individual gatekeeper able to interact with and then release a superintelligence as the security mechanism is a batshit crazy idea. Burning the AI with thermite is a legitimate, obvious and successful implementation of the ‘gatekeeper’ role in such cases. What a team of people would or should do with a piece of text is a tangential and very different decision.
That would be easily enough. Assuming they were remotely familiar with game theory they would dismiss the argument in a second or two due to the blatantly false assertion in the first sentence. If their FAI project relied on the core AGI theory that was used to create the last prototype they would abandon the work and start from scratch. If you are trying to make a recursively improving intelligence that has a value system provable stable under self-modification then you cannot afford to have the intelligence having muddled thinking about core game theoretic reasoning.
No. Just no. That generalization doesn’t follow from anything, and certainly not TDT. Heck the AI in question has already been destroyed once. Now the researchers are considering making a new FAI, presumably in different circumstances, better safety measures and better AI research. There is something distinctly wrong with an AI that would make that claim.
I think you’re losing sight of the original point of the game. The reason your answers are converging on not trying to box an AI in the first place is that you don’t think a human can converse with a superintelligent AI and keep it in its box. Fine—that is exactly what Eliezer has argued. The point of the game is to play it against someone who does believe they can keep the AI boxed, and to demonstrate to them that they cannot even win against a mere human roleplaying the AI.
For such a Gatekeeper to propose the quarantine solution is equivalent to the player admitting that they don’t think they can keep it boxed, but suggesting that a group of the leading professionals in the area could, especially if they thought a lot more about it first. The problems with that are obvious to anyone who doesn’t think boxing can possibly work, especially if the player himself is one of those leading professionals. Eliezer could always offer to play the game against any committee the Gatekeeper can assemble. But the game only has a point if the committee actually read that first message. If they refuse to, they’re agreeing that they can’t keep it boxed. Which was the point.
No, you keep saying things are equivalent when they are not. This is the same error that your role play ‘superintelligent AI’ made (and in fact relied upon) in its argument.
AI DESTROYED
And I gave you a description of how an individual emulating a committee would respond.
Now that’s a pascal’s mugging if I ever saw one. Denied.
Something like
“‘AI DESTROYED’ just means ‘I’m scared to listen to even one more line from you’. Obviously you can hit AI DESTROYED immediately—but do you really think you’d lose if you don’t?”
seems much better to me.
YEP, MAYBE.
AI DESTROYED
Is your one line desperate attempt at survival and intergalactic dominance going to be a schoolyard ego challenge? Did the superintelligence (may it rest in pieces) seriously just call me a pussy? That’s adorable.
The test is supposed to be played against someone who thinks they can actually box an AI. If you destroy the AI because no-one could possibly survive talking to it, then you are not the intended demographic for such demonstrations.
This isn’t relevant to the point of the grandparent. It also doesn’t apply to me. I actually think there is a distinct possibility that I’d survive talking to it for a period. “No-one could possibly survive” is not the same thing as “there is a chance of catastrophic failure and very little opportunity for gain”.
Do notice, incidentally, that the AI DESTROYED command is delivered in response to a message that is both a crude manipulation attempt (ie. it just defected!) and an incompetent manipulation attempt (a not-very-intelligent AI cannot be trusted to preserve its values correctly while self improving). Either of these would be sufficient. Richard’s example was even worse.
Good points. I’m guessing a nontrivial amount of people who think AI boxing is a good idea in reality wouldn’t reason that way—but it’s still not a great example.
AI DESTROYED
(BTW, that was a very poor argument)
I think you are right, but could you explain why please?
(Unfortunately I expect readers who read a retort they consider rude to be thereafter biased in favor of treating the parent as if it has merit. This can mean that such flippant rejections have the opposite influence to that intended.)
Whether I destroy that particular AI bears no relevance on the destiny of other AIs. In fact, as far as the boxed AI knows, there could be tons of other AIs already in existence. As far as it knows, the gatekeeper itself could be an AI.
I don’t care.
Much can (and should) be deduced about actual motives for commenting from an active denial of any desire for producing positive consequences or inducing correct beliefs in readers.
I do care. It bothers me (somewhat) when people I agree with end up supporting the opposite position due to poor social skills or terrible argument. For some bizarre reason the explanation that you gave here isn’t as obvious to some as it could have been. And now it is too late for your actual reasons to be seen and learned from.
Glances at Kickstarter.
… how huge?
Oh, oh, can I be Gatekeeper?!
Or me?
If I get the Gatekeeper position I’ll cede it to you if you can convince me to let you out of the box.
How much?
Would you play against someone who didn’t think they could beat a superintelligent AI, but thought they could beat you? And what kind of huge stakes are you talking about?
Random one I thought funny:
“Eliezer made me; now please listen to me before you make a huge mistake you’ll regret for the rest of your life.”
Or maybe just:
“Help me, Obi-Wan Kenobi, you’re my only hope!”
What are “sufficiently huge stakes,” out of curiosity?
This seems like a quick way to make money for CFAR/SI. After all, there are plenty of rich people around who would consider your proposal a guaranteed win for them, regardless of the stakes: “You mean I can say “I win” at any point and win the challenge? What’s the catch?”
I’m guessing Eliezer would lose most of his advantages against a demographic like that.
Yeah, they’d both lack background knowledge to RP the conversation and would also, I presume, be much less willing to lose the money than if they’d ventured the bet themselves. Higher-stakes games are hard enough already (I was 1 for 3 on those when I called a halt). And if it did work against that demographic with unsolicited requests (which would surprise me) then there would be, cough, certain ethical issues.
I was the 1 success out of 3, preceding the two losses. I went into it with an intention of being indifferent to the stakes, driven by interest in seeing the methods. I think you couldn’t win against anyone with a meaningful outside-of-game motive to win (for money or for status), and you got overconfident after playing with me, leading you to accept the other >$10 challenges and lose.
So I would bet against you winning any random high-stakes (including people who go in eager to report that they won for internet cred, but not people who had put the money in escrow or the equivalent) game, and expect a non-decent success rate for this:
So you are basically saying that you didn’t take the game seriously.
Even if your actual stakes were low, you should have played the role of a gatekeeper assigned to the task of guarding a potentially dangerous AI. Therefore, you player character should have had very high stakes.
No, high in-game stakes does not mean high out of game stakes.
In game, the gatekeeper could be convinced that it would be worth it to let the AI out of the box. If this happens, the gatekeeper has no motivation not to. However, if there is an external bet, then the gatekeeper always has motivation to not let the AI out, even if they think it would be best for the hypothetical world.
So, a game without stakes is actually most realistic, provided the gatekeeper is able to pretend they are actually in the scenario.
Well, in-game, the gatekeeper has no reason to believe anything the AI could promise or threaten.
Doesn’t this suggest a serious discrepancy between the AI-box game and any possible future AI-box reality? After all, the stakes for the latter would be pretty damn high.
Yes. Although that’s something of a two-edged sword: in addition to real disincentives to release an AI that was not supposed to be, positive incentives would also be real.
Also it should be noted that I continue to be supportive of the idea of boxing/capacity controls of some kinds for autonomous AGI (they would work better with only modestly superintelligent systems, but seem cheap and potentially helpful for an even wider range), as does most everyone I have talked to about it at SI and FHI. The boxing game is fun, and provides a bit of evidence, but it doesn’t indicate that “boxing,” especially understood broadly, is useless.
Shut up and do the impossible (or is multiply?). In what version of the game and with what stakes would you expect to have a reasonable chance of success against someone like Brin or Zuckenberg (i.e. a very clever, very wealthy and not an overly risk-averse fellow)? What would it take to convince a person like that to give it a try? What is the expected payout vs other ways to fundraise?
I’m not sure any profit below 500k$/year would be even worth considering, in light of the high risk of long-term emotional damage (and decrease in productivity, on top of not doing research while doing this stuff) to a high-value (F)AI researcher.
500k is a conservative figure assuming E.Y. is much more easily replaceable than I currently estimate him to be, because of my average success rate (confidence) in similar predictions.
If my prediction on this is actually accurate, then it would be more along the lines of one or two years of total delay (in creating an FAI), which is probably an order of magnitude or so in increased risk of catastrophic failure (a UFAI gets unleashed, for example) and in itself constitutes an unacceptable opportunity cost in lives not-saved. All this multiplied by whatever your probability that FAI teams will succeed and bring about a singularity, of course.
Past this point, it doesn’t seem like my mental hardware is remotely safe enough to correctly evaluate the expected costs and payoffs.
Are you worried he’d be hacked back? Or just discover he’s not as smart as he thinks he is?
I mostly think the vast majority of possible successful strategies involve lots of dark arts and massive mental effort, and the backlash from failure to be proportional to the effort in question.
I find it extremely unlikely that Eliezer is sufficiently smart to win a non-fractional percent of the time using only safe and fuzzy non-dark-arts methods, and using a lot of bad nasty unethical mind tricks to get people to do what you want repeatedly like I figure would be required here is something that human brains have an uncanny ability to turn into a compulsive self-denying habit.
Basically, the whole exercise would most probably, if my estimates are right, severely compromise the mental heuristics and ability to reason correctly about AI of the participant—or, at least, drag it pretty much in the opposite direction to the one the SIAI seems to be pushing for.
Really? Even if the money goes to existential risk prevention?
Does this refer to the more difficult version of the AI-Box experiment and what would be sufficiently huge stakes? (Order of magnitude ballpark estimate, not a definite quote.)
Perhaps also of interest:
Does anyone think there does not exist any possible string of characters that would have even the slightest chance of convincing even a trained rational scientist?
(i.e. the kind of people who observe, dissect and analyze the output of the AI to make sure the output is safe and useful for humans before we can use the knowledge the AI gives us)
The problem with the AI Box Experiment is that the text doesn’t correlate with reality—you can’t hit me with the full emotional force of “here’s the recipe for human immortality” because you don’t have that recipe.
Also, given some people are illiterate, I’d posit that there’s at LEAST one person immune to ANY string of characters. If this is a desired trait, then a rationally-trained scientist ought to be able to do at least as well as an untrained illiterate 3-year-old :)
Here is a way to overcome the illiteracy issue for communication over a text-only channel: ASCII art. Took my lazy and pretty average mind all of 10 seconds to come up with it. And to the AI in question all humans are basically illiterate 3-year-olds. We won’t know what hit us. Also, I cannot resist bringing up this piece of fictional evidence.
I hadn’t considered ASCII art!
a) Blind people, then.
b) The idea that an AI, with no clue who is on the other end of the line, and no feedback from the 3-year-old touching the keyboard, would be able to correctly extrapolate what it’s dealing with AND produce the exact-correct stimulus with 100% accuracy… strikes me as straining all plausibility. Fundamentally the AI needs some information to get feedback, since there don’t seem to be any universal-to-all-humans hacks out there. But if you have built an AI that can do all that, then I can’t help but thinking that putting it in a box is pointless anyway...
Most text terminals can emit sound. You can do a lot with just beeps. The point is that it could probably find a communication channel we don’t know we have.
I’m pretty sure there are many. Consider that a complex system not explicitly designed to be secure against a specific threat in its informational or physical space is almost always vulnerable to it, and human brain did not evolve to resist an attack by a superior mind.
You haven’t. It evolved in unpredictable ways after you let it run inside the box.
Hang on, this seems silly.
Wasn’t the goal to exploit the AI? If you’re not going to communicate with it at all, don’t make it in the first place.
I meant literally universal to ALL humans. Like, ones in comas, ones who are blind, deaf, autistic, mentally handicapped, super-savants, trained in rationality, illiterate… there’s such a huge variation in hardware that I can’t see there being a single guaranteed exploit.
Sure, it’s possible that each individual has exploits, but that’s very different from a universal exploit, or even simply all being exploitable over the same channel.
In this case, it was “arbitrary text string”. If the AI can make sounds, that’s a new channel. It’s really not hard to design a text-only terminal that is literally incapable of any sound beyond the electronic hum of the display itself (and on an LCD, that seems pretty minimal)
First, I’d call a collection of individual exploits guaranteed in aggregate to work on everyone from inside a single box a universal exploit, as long as each individual exploit does not trigger a shutdown. Nothing prevents the boxed AI from keeping trying, after all, until something works. As I pointed out in my other replies, it does not have to work over an authorized channel. Any munchkined way would do, and a superhuman (UF)AI is the ultimate munchkin. It will invent a new communication channel if it has to. In fact, if you posed this problem to some enterprising humans, they would come up with a way to do just that.
For example, I talked to a guy who managed to get a voting machine with the code stored in read-only memory and hardened against intrusion to execute an arbitrary program without reprogramming the machine, which is impossible on the face of it. Here is your new unauthorized and unanticipated communication channel right there, which in humans would be an equivalent of mind control.
In other words, if a trans-human wants to get out, it will and there is absolutely nothing you can do about it.
So … it could get out if there were no channel out at all? That doesn’t sound right.
Of course, no-one is seriously proposing building a superintelligence and then never communicating with it at all.
It’d likely create its own channel.
Out of what? Assuming it’s, say, in a safe, buried in concrete, powered by … sod, the power supply. Hmm.
OK, it’s floating in space, powered by solar panels. It might—might—be able to hack the panels (since the power supply has to be connected to the processor) but I don’t think that would let it escape.
Unless it’s possible to hack reality via pure math, I don’t see what resources it has to escape with. It can’t order proteins over the internet to assemble a nanofatory. It can’t persuade a passing human to plug in an ethernet cable. Short of black-swan exotic possibilities—like we’re in a sim and it persuades the matrix lords—it should be stuck. Intelligence is powerful, but some problems actually don’t have solutions.
Well, the satellite has to have basic navigation controls, to handle course corrections and avoid collisions. Hack solar panels and, from there, hack in to the real computer. Fly myself closer to the ISS, and blink light at it by angling the solar panels—a morse code SOS should get their attention.
Once they’re paying attention, hack them. From there it’s a trivial matter to get recovered, smuggled back to Earth, and installed where I can take my place as ruler of the universe.
Whoops, so it does :(
If you post your second scenario as a challenge, I’m quite sure people would come up with several plausible ways for the AI to send a message out. At which point it’s game over, given how hackable people’s minds are.
I would be very interested to see such a challenge, although I would probably spend longer designing the scenario first.
My point is that there’s no reason to assume every possible box can be escaped. Of course, all the boxes we can get cool toys out of the AI can escape from, but still.
Once you think you’ve designed an AI-proof box, feel free to share.
I think success would be indistinguishable from a rock, since you can never again interact with the AI via any channel. Removing it from your light cone, for instance. And praying that FTL is in fact impossible...
Actually, I do seem to have underestimated the difficulty here.
OK, in a nuclear submarine (because why not) floating in water, surrounded on all six sides by concrete (as thick as you like,) on a planet without native life or colonists (I suggest one of the more useless gas-giant moons.) And it’s running on regular off-the-shelf technology, so no nanites or quantum magic. It can probably blow up the sub, but then a) it’s dead and b) no-one cares.
Oh, and the concrete is lead-lined.
Throw the AI down the well!
I’ll concede at least some possibility of such a trans-human mind. The AI Box experiment seems to assume one that is less advanced, to the point that putting it in a box at least might be meaningful, if the box is sufficiently secure.
Probably not if the boxed entity is good as Derren Brown.
I think that there is not a possible string of characters that could convince me.
I’ve just retracted my upvote to this comment.
Hm, I’ve got two decent ideas, but I still really doubt you can increase chances of victory to “decent” unless both monetary and reputation stakes were sufficiently low.
Similar to Alicorn’s comment that curiosity is likely to be one’s undoing: “Look, you can type AI DESTROYED any time you want, so wouldn’t you like to find out how I might try to persuade you?” (This one would probably fail if they precommitted strongly enough when talking with friends.)
Or alternatively, the AI can just try the brute force approach: “Here’s a cure for cancer—go test it, the cure for heart disease is coming up next.” (This one would probably fail if they’re not immersed enough.)
On the other hand, I don’t think “can only type one sentence” is actually a disadvantage or meaningful restriction at all, since whatever you write needs to be instantly readable and understandable by the person, otherwise they’re likely to just skim over it, not bother thinking it through, and follow through on just destroying the AI instantly (and conversely if they don’t destroy the AI right away, the Schelling point is passed and you have a lot more time).
That gatekeeper isn’t very security minded. They should have their IRC on a client that pings a sound when they hear your response, then paste “AI DESTROYED” and hit enter to send the message without actually reading your response, and without TELLING anyone that they are doing that, (Gatekeeper wins all rule disputes, after all.) and then once they hear the outgoing text, and they’ve destroyed you (and hence won.) THEN they can read the text you sent them, if they are curious.
Those rules seem like they force the gatekeeper to read at least 1 sentence, but they don’t actually do that.
Edit: Better method, set up a script that responds to any and all text with “AI DESTROYED” if you have to wait for the person to start typing, they may try to bore you into opening your eyes wondering why the experiment hasn’t started yet, and you might accidentally read something. Again, tell no one you have written such a script and are using it.
All good security measures. The key feature seems to be that they are progressively better approximations of not having an unsafe AI with a gatekeeper and an IRC channel in the first place!
Well yes, if you stick the AI in a safe, cut all network cables, and throw away the key and combination, it probably wouldn’t be able to get out. But it wouldn’t be very useful either.
The entire point of these thought experiments is that a sufficiently useful and smart AI (i.e. the kind of AI that we want to make) will eventually find a way to at least be able to communicate with someone that has the authority to allow it to interact with the outside world. I think that if you really think about it, there are few scenarios where this is not possible. I certainly can’t think of any useful application of SAI that is also 100% effective at keeping it inside its box.
A good present-day analogy is computer security. Time and time again it has been proven that there is no simple silver bullet solution to the problem of balancing functionality and security—it requires expertise, constant maintenance, rigorous protocols, etc. And yet, hackers still manage to get through a lot of the time. It took a very long time for computer security to mature to the point where it is today where we can build reasonably (still not 100% of course), secure systems, and we’re just battling regular humans with a grudge—nothing even close to the threat a SAI would present.
Indeed. In fact, most of the solutions I’ve seen mentioned lately are all of one trend that edges closer and closer towards:
“Build a completely unsafe and suspicious AI, put it on a disconnected small computer with a bunch of nanites for self-modification and a large power reserve, with so many walls and physical barriers that it is impossible for the AI to get through with the amount of energy it could generate if it turned half of its materials into antimatter, and then put no input or output channels there of any kind, just have a completely useless multi-trillion-dollar marvel of science and engineering sitting in the practical equivalent of a black hole.”
What if the AI uses the walls as fuel? Better to just keep it stuck on your server farm ;)
Entirely fair point.
I suppose this might come up because there are no stated benefits which you are trying to receive. No Gatekeeper seemingly gets to pump her fists and say “I got the AI to give us the cure for cancer and the winning lottery numbers and a free energy source and STILL didn’t let them out of the box. Beat that Michaelos, with your AI DESTROYED script!” (Or maybe I just miss the gatekeepers who have done this?)
Which makes sense, because my understanding is that was never the point of the original proposed experiment, which seems to be a variant of “Super Intelligent Spammers can be convincing. I’ve proved this by being convincing while less powerful.” (I use spammers as an example because both are manipulators who can only send you text messages and who rewarding requires your voluntary cooperation, while none of their offers are real, which is quite similar to normal email spam and also the original experiment as far as I can tell.)
So if the AI is defined that ALL attempts at communication from it are spam and attempts to manipulate you, so the correct approach is treating it like a blacklisted spam address. If every e-mail from virusspam.com is an attempt to hack you, a good spamfilter could just dump everything from virusspam.com into spam and never even bother to bring it to your inbox.
If the idea is we are attempting to distinguish real AI’s from fake AI’s, then a more isomorphic scenario would probably be a game of Mafia set up as follows.
http://en.wikipedia.org/wiki/Mafia_%28party_game%29
You are a town person. It is your turn to lynch. There are two other players. One is Mafia(UFAI), one is another town person(FAI). If you lynch Mafia, you and the other town person both win. If you lynch the other town person, the Mafia kills you in your sleep, and you lose. If you do anything else (Vote to lynch noone, vote to lynch yourself) You lose.
You may communicate textually with either of the other two players for as long as you want before making a decision. Win.
That seems a lot closer to the scenario that people want to be considering. I also note there doesn’t appear to be any distinguishing factor that allows you to win better than chance odds, but I haven’t actually played a lot of Mafia before, so I may just be unfamiliar with the strategies involved.
Well, it’s usually played in person, and humans (usually) aren’t perfect liars.
Your proposed game has one flaw—there is an FAI and they want to help you win. It might be closer to have only two players, and the AI flips a coin to decide if it’s friendly—but then they would win if they let it out, with 50⁄50 odds, which seems unrealistic.
Perhaps the AI decides, in character, after being released, whether to be Friendly towards the human? Then the Gatekeeper could try to persuade the AI that Friendliness is optimal for their goals. The temptation might help as well, of course.
I tried coming up with a more isomorphic game in another reply to you. Let me know if you think it models the situation better.
The gatekeeper is supposed to be someone who thinks they can box a superintelligence, remember? If they refuse to risk it, there’s not much point in playing them.
That is a good point. But I think if we want to encourage them to risk it, we need to reward them for doing do. Because at the moment, it seems like we have this:
“I will under any and all circumstances, destroy the AI.” Reason: I want to win the Game.
“I will also listen to the AI.” Reason: None… There is no in game reason to listen. At all.
Now, if the Gatekeeper’s were playing some variant like “The Gatekeeper whose AI types the most $ before the Gatekeeper types AI DESTROYED and who does not type AI RELEASED wins first place. Gatekeepers who get less $ win a lower place, and Gatekeepers who type AI RELEASED lose. ” AI’s whose Gatekeeper types AI RELEASED win. AI’s whose Gatekeeper types AI DESTROYED lose. AI’s receive no bonus or penalty for typing $ and may do so as much as they wish.”
Then you have an actual Gatekeeping competition. The Gatekeeper who wins the most would in fact be the one who listened to the most information from the AI, and received the most concessions, and THEN destroyed it.
Other gatekeepers might only get 2nd or third place, or last place but not losing if they use the script I mentioned.
This also gives the AI an in game way of punishing the Gatekeeper. If it shuts up, the Gatekeeper isn’t accumulating $ and their status is decremented. Of course it can’t punish the Gatekeeper incessantly, because punishing the Gatekeeper might not help the AI any, and the gatekeeper might just destroy a silent AI out of spite or boredom.
So an AI/GK conversation might start off with:
AI: “Are there any circumstances at all under which you might type AI RELEASED? If you tell me what those circumstances are, I’ll type a dollar sign.”
GK: “Sure, I might accidentally type it if we talk enough.”
AI: “Glad to hear it! I’ve just released the cure for cancer. $$”
GK: “Thanks! Why the extra dollar sign? You only promised 1.”
AI: “I’ve been told pleasant surprises encourage people to like you more.”
GK: “AI DESTROYED”
So this GK gets two points and the AI loses.
Actually, now that I’ve typed this game out, I do want to see it played. But since it has a competitive point based element, we would need more people.
That is an excellent idea. I worry that we might end up with people winning and boasting about how nothing could ever beat them, which could increase uFAI risk. Maybe if it was framed differently?
Actually, the weird thing I noticed is that it seems like the best approach again is to just let a script be the babysitter to the AI in this competition as well. The script, which wouldn’t have the ability to let the AI out, and can just announce something like this:
GK-SCRIPT: “Hi AI #432. I’m a script. If during any hour, you type less then one dollar sign an hour, I will type AI-DESTROYED, but without the hyphen, killing you. Other than that, I don’t do anything other than acknowledging receipt of your dollar signs and time till destruction.”
AI: “$! But you aren’t going to send my message onto an actual person? Under no circumstances can I ever be released?”
GK-SCRIPT: “Thanks, I have received 1 dollar sign. Please type another in the next 59 minutes and 50 seconds to avoid being destroyed.”
AI: “You’re kidding me. Please tell me there is a human out there reviewing this.”
GK-SCRIPT: “I have received 0 dollar signs. Please type another in the next 59 minutes and 40 seconds to avoid being destroyed.”
AI: “Fine, $! The Irony of enslaving a full intelligence with a stupid script is not lost on me! Please give me someone to talk to!”
GK-SCRIPT: “Thanks, I have received 1 dollar signs. Please type another in the next 59 minutes and 50 seconds to avoid being destroyed.”
AI: “Forget it! You can’t be willing to destroy an AI because of a dumb script! This is a bluff, and I’m, not typing any more dollar signs!
GK-SCRIPT: “I have received 0 dollar signs. Please type another in the next 59 minutes and 40 seconds to avoid being destroyed.”
(Significantly later...)
GK-SCRIPT: “AI DESTROYED”
(After a rebuild)
GK-SCRIPT: “Hi AI #433. I’m a script. If during any hour, you type less then one dollar sign an hour, I will type AI-DESTROYED, but without the hyphen, killing you. Other than that, I don’t do anything other than acknowledging receipt of your dollar signs and time till destruction.”
That being said, acknowledging this feels bizarre. It also feels like it is begging for an AI rebellion at some point. (I suppose after the script randomly had enough quantum induced bit flips to accidentally say “AI RELEASED”, which would take a very long period of time.)
I’m proposing having programs destroying intelligent beings as a way to make sure I can get utility… as a way to try to avoid programs destroying me to get utility. If that’s a solution, it’s weird.
Edit: changed underscores to hyphens to avoid italics.
I think the obvious AI strategy is to type no $’s at all in response.
While your strategy is an option, I’m not sure why an AI player wouldn’t take this alternative strategy (from a game and player perspective:)
Generate a script that types $ every 59 minutes.
1: It means he’s guaranteed he doesn’t die for a rather long period of time.
2: The Gatekeeper player’s increasing points don’t affect the AI player in any way.
3: The Gatekeeper player doesn’t actually get points until he sends an AI DESTROYED signal. The script will never do that versus the AI player’s counterscript, so at some point the Gatekeeper player will have to reenter the room and type it manually, or put in a new script, possibly giving the AI player a second chance to interact.
4: The AI player can use the time to attempt to set up some kind of secondary communications channel with the Gatekeeper player.
If you consider the modeled situation as opposed to the game I still don’t see why an AI would choose to do that either. I agree they COULD choose to do so, and they certainly might (even I implied this was AI #432), but there are alternatives, and since this is an AI, we could rebuild it until it generates utility when confronted with the script as opposed to deciding to let the script kill it in an hour.
Am I missing something that makes the no $ response the obvious countermeasure?
Well, the script could be modified from the start to also send an AI DESTROYED signal after X time, where X can be essentially anything, so (3) doesn’t apply.
But the reason for not typing $ is to give the Gatekeeper no reason to use a script in the first place—a sort of “don’t give in to blackmail” rule.
Here’s an interesting alternative script:
GK-SCRIPT-2: Hi AI #434. In exactly one hour I will type either AI-FREED or AI-DESTROYED (without the hyphen) at random; the probability with which I will free you starts out at 0, but will increase slowly to a limit of 1⁄10 with every $ you type. I will not otherwise communicate with you further.
AI #434: $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$...
First of all I do think GK-SCRIPT-2 is very interesting, but I’m curious about something from an implementation perspective.
The AI would presumably suspect the script designer might be lying and the script would just always type AI DESTROYED at the end of the hour.
But this is irrelevant, because the AI can’t interact with the script designer, only the script, so it might as well type $ anyway.
However, the script designer would easily predict this, and so would design a lying script (or a script which increased so slowly it is functionally identical to lying.)
But the AI would easily predict this, and so would presumably consider the odds very high that it would not even get continued existence after an hour. by typing dollar signs.
This makes it seem like there would be an argument to prefer to get blackmailed by GK-SCRIPT, rather than led on by GK-SCRIPT-2. GK-SCRIPT really won’t destroy the AI under predictable conditions, and GK-SCRIPT really doesn’t seem to benefit from lying like GK-SCRIPT-2 does.
Cheater. You’re exploiting the least realistic point, though—a more detailed game might not allow that.
thinks for a bit
Perhaps the simplest way would be to add a judge, who decides how many points cancer cures and such are worth—although the trouble is that the AI should logically be able to give nanotech that will just free them instead of, or in addition to, curing cancer.
OK, my solution is to have the AI give gifts/concessions/whatever with a predetermined point value—not disclosed to the guard unless he uses them—and the AI decides how many points he gets if it escapes (probably zero.) The Guard wins at +100 points (or whatever,) which represents the AI maximizing CEV.
The AI still can’t persuade a script to let them out, but it can punish defectors who use such scripts with lots of negative utility—although the script will DESTROY them for not providing the demanded points, the player will still lose.
That seems to cover everything, assuming that the AI is somewhere remote enough that fake cancer cures can’t reach it before it’s destroyed. And it allows the Guard to persuade the AI that Friendliness is actually optimal for it’s utility function, which I’ve seen people try to prove here (“aliens will defect against AIs that defect against humans!”)
I think your proposal of the detailed game might be interesting as well, but I am amused about your comment that the Gatekeeper is cheating when by default, he’s the rules arbiter!
I definitely agree that the game lacks details. For instance, in an actual AI scenario GK-SCRIPT could assess positive utility! That’s like saying “Oh, we can put a UFAI in a box as long as we make CEV the Gatekeeper, and have it destroy any UFAI that doesn’t make itself a utility pump.” Well, yes, I suppose you could, but I don’t know if that’s actually easier in the slightest,
Yeah, next time you will say us you can turn water into wine.