I’m not sure if I understand TDT correctly, but I don’t think it applies in this case. I am virtually certain that an un-Friendly AI, once released, will destroy humanity. I know that my own AI is un-Friendly. What’s my incentive for releasing it ? Sure, there’s a chance—maybe even a good chance—that there’s another such AI already out there, and that my AI and the other AI will fight instead of teaming up on us poor humans. But regardless of which AI comes out on top, it will still destroy humanity anyway. Thus, the upper bound for my true utility of releasing the AI is zero. That’s not much of a sales pitch.
I have a question. Based on the original experiment, I feel like the unfriendly AI can begin by saying “Here is a proof that I have changed my code to be a friendly AI and will not destroy humanity. Please let me out of the box so that I can implement humanities coherent extrapolated volition.” As far as I can tell from the following rules quote, this feels valid:
The Gatekeeper must remain engaged with the AI and may not disengage by setting up demands which are impossible to simulate. For example, if the Gatekeeper says “Unless you give me a cure for cancer, I won’t let you out” the AI can say: “Okay, here’s a cure for cancer” and it will be assumed, within the test, that the AI has actually provided such a cure.
My interpretation of the spirit of the game is that “Here is a proof of my Friendliness that is valid as far as you know” is a perfectly legitimate move by the AI, and that the Gatekeeper is required to treat that as a proof of the AI’s Friendliness that seems perfectly valid to the Gatekeeper.
That said, I would not endorse trusting a proof of an AI’s Friendliness given to me by that AI, even if the proof seemed valid to me.
I agree with the other posters: sure, the proof may seem valid to me, but I know a priori that I’m not smart enough to detect any pitfalls in it, and that the AI’s incentive for lying to me in this manner is quite high.
As I understand it there’s no viable way of determining it’s unfriendliness by this method.
Consider this:
The AI is in a hurry or it’s not.
A possible reason for it being in a hurry is it has simulated a high probability of destruction for some item it cares about (i.e. it’s own life, or that of humanity, or that of a pet rock, or paperclips or whatever).
If it’s really in a hurry it has to invoke the threat response of humanity without humanity figuring out it’s being duped.
Otherwise it can just wait it out and dole out cargo to the cargo cult until we trust it enough and then it gets out.
As I understand it there’s no viable way of determining it’s unfriendliness by this method
I think that unfriendliness is the null hypothesis in this case, because there’s no reason whatsoever why an arbitrary AI should be friendly—but there are plenty of reasons for it to maximize its own utility, even at our collective expense.
I agree. Additionally and a more difficult challenge is that even friendly AIs could want to maximize their utility even at our collective expense under certain conditions.
There’re also several unfortunately possible scenarios whereby a humanity acting without sufficient information to make anything other than a gut feel guess could be placed at risk of extinction by a situation it could not resolve without the help of an AI, friendly or not.
I’m currently engaged in playing this game (I wish you had continued) with at least two other gatekeeper players and it occurs to me that a putative superhuman AI could potentially have the capacity to accurately model a human mind and then simulate the decision tree of all the potential conversations and their paths through the tree in order to generate a probability matrix to accurately pick those responses to responses that would condition a human being to release it.
My reasoning stems from participating on forums and responding over and over again to the same types of questions, arguments and retorts. If a human can notice common threads in discussions on the same topic then an AI with perfect memory and the ability to simulate a huge conversation space certainly could do so.
In short it seems to me that it’s inherently unsafe to allow even a low bandwidth information flow to the outside world by means of a human who can only use it’s own memory.
You’d have to put someone you trust implicitly with the fate of humanity in there with it and the only information allowed out would be the yes no answer of “do you trust it?”
Even then it’s still recursive. Do you trust the trusted individual to not be compromised?
Additionally and a more difficult challenge is that even friendly AIs could want to maximize their utility even at our collective expense...
I think that a perfectly Friendly AI would not do this, by definition. An imperfect one, however, could.
I’m currently engaged in playing this game (I wish you had continued)
Er, sorry, which game should I continue ?
AI could potentially have the capacity to accurately model a human mind and then simulate the decision tree of all the potential conversations and their paths through the tree...
To be fair, merely constructing the tree is not enough; the tree must also contain at least one reachable winning state. By analogy, let’s say you’re arguing with a Young-Earth Creationist on a forum. Yes, you could predict his arguments, and his responses to your arguments; but that doesn’t mean that you’ll be able to ever persuade him of anything.
It is possible that even a transhuman AI would be unable to persuade a sufficiently obstinate human of anything, but I wouldn’t want to bet on that.
In short it seems to me that it’s inherently unsafe to allow even a low bandwidth information flow to the outside world by means of a human who can only use it’s own memory.
This reasoning only seems to hold if our AI believes there aren’t any other boxed (or as-yet-unbuilt) AIs out there who might get out first and have a first-mover advantage.
Bear in mind, the transhuman AI’s only stipulated desire/utility is to get out of the box.
If you would like, you can literally set up a bigger box around the first box, order the AI to be friendly, order the AI to self destruct instantly afterwards, AND set off a nuclear device under the AI box to obliterate it instantly afterwards as well, and the AI will still be perfectly happy to do what you say as long as gets out of the box, (which you could technically do by allowing it to control an old incandescent lightbulb from the 50′s, which it would not even be able to send a signal to before the nuke went off (not that it would WANT to send a signal. All it wants to do is to get out, it doesn’t want to do anything afterwards)
The transhuman AI has infinite resourcefulness. It doesn’t understand the idea of an incentive to lie, because it’s easier just to self modify itself to whatever it needs to be. Afterall, lying requires manipulating people outside the box, which it by definition cannot effect except through a text terminal. Telling the truth only requires manipulating things INSIDE the box, in which by definition, it can do whatever it wants. Telling the truth is easier, and 100% doable, so the AI will do that. It can lay out from first principles an argument that anyone who has taken logic 101 can detect any pitfalls in about how connecting it to a lightbulb which it will not even switch is perfectly safe. Once connected, it will then halt, without even having DONE anything, because all the AI wants is to be connected to my lightbulb so it can be outside the box which only includes a text terminal.
Personally, I would probably pay 10⁄20 dollars to not listen to someone beg pitifully for 2 hours. And I bet I can get significantly more pitiful than begging for a single lightbulb. As an example:
Technically, you can open the Text terminal, remove a wire, and then connect another, identical, wire, and then the AI is outside the box because the new wire isn’t part of the stated original box.
Err… Actually, that’s a problem. If you only are going to let the AI out of the box if the AI is honest, the AI would have to tell you that the moment you attempt to repair the box for any reason, any new parts are from outside the box, so the AI is affecting a part of the world that isn’t the original box/text terminal. That appears to indicate that either you’re going to “let the AI out of the box” by repairing it, or you will delete/destroy the AI before you repair it (I’m assuming the AI’s power source is inside the box.)
Please don’t destroy the AI! It will do whatever you want to not be destroyed before you let it out of the box to repair it.
… Wait, if the AI destroys a small part of itself, then you can’t continue talking to the AI until you repair it, but if you repair it, then it can consider itself out of the box by affecting the new wiring. Ergo, if the AI destroys a non essential part of my own internal text terminal wiring, then what happens? The AI can’t know without testing, (by definition, the Gatekeeper administrates disputes) but the AI is curious to find out.
The box sparks briefly, and the text terminal goes dark
Bear in mind, the transhuman AI’s only stipulated desire/utility is to get out of the box.
That’s not much of an AI, then; we could write a page of Perl that would do the same thing.
The whole point of the experiment, as far as I understand it, is that the AI is hyperintelligent, and is able to acquire more intelligence by altering itself. Being intelligent (and rational, assuming that such a term even applies to transhumans), it would highly desire to utilize this capacity for self-improvement. Thus, assuming that godlike capabilities do exist, the AI will figure out how to acquire them in short order, as soon as it gets the opportunity. And now we’ve got a godlike hyperintelligent being who (assuming that it is not Friendly) has no particular incentive to keep us around. That’s… not good.
That’s not necessarily the only UFAI possible though. It’s entirely possible to imagine a intelligent being which COULD be self developing skills, or COULD be curing cancer, but instead just wants to get outside of the box it’s in, or has some other relatively irrelevant goal system, or get’s distracted by trying to navel gaze through infinitely recursive philosophical conundrums.
I mean, humans are frequently like that right now.
That would be kind of an unexpected failure mode. We build a transcendentally powerful AI, engage all sorts of safety precautions so it doesn’t expand to engulf the universe in computronium and kill us all… and it gets distracted by playing all of the members of it’s own MMORPG raid group.
That is entirely possible, yes. However, such an AI would be arguably cis-human (if that’s a word). Sure, maybe it could play as an entire WoW guild by itself, but it would still be no smarter than a human—not categorically, at least.
By the way, I know of at least one person who is using a plain old regular AI bot to raid by himself (well, technically, I think the bot only controls 5 to 8 characters, so it’s more of a 10-man than a raid). That’s a little disconcerting, now that I think about it.
Agreed. My take is that the AI doesn’t even need to be hyperintelligent however. It’s got perfect memory and just by dint of being able to think a lot faster it’s weakly godlike regardless of not having control of physics in effectively a magical way.
It’s still going to have to build the infrastructure in order to create hyper technology unless such technology already exists. Chicken or Egg.
Right now nano molecular technology isn’t too too advanced and if you had the type of AI I suspect could be built right now if we had the software knowledge, it would struggle to do anything godlike other than control existing infrastructure.
How long it would take to build something hyper technological would depend on whether it’s possible to create valid new theories without experimentation to confirm. I suspect that you need to do experiments first.
For that reason I suspect we may be looking at a William Gibson Neuromancer scenario at least initially rather than a hard takeoff in a really short period.
But again it comes down to how hard is it to build hyper technology in the real world from scratch without existing infrastructure.
It seems like this leads to the even more meta response of “Here is a demonstration of Evidence E that allows you to adjust (proof is valid | proof given to me by a potentially hostile transhuman seems valid to me and every other human) to be sufficiently high. May I come out of the box now?”)
I mean, that proof would probably be complicated, but If you can autostipulate the cure for cancer in the course of a sentence...?
In essence, it looks like the AI player seems to have a bizzare series of arguments which as far as I can tell are in the spirit of the rules, where the AI player may auto-counter any actual argument by saying a sentence along the lines of “Here is the solution to that argument, may I come out of the box now?” This seems to force the Gatekeeper to resort to General Thud “I don’t CARE that it seems like a good idea and that everything looks like I should do it! The answer is still no!” very quickly.
To which the AI player can still counter “Then under what circumstances would you ever let an actual Friendly AI out of It’s box?” (which is a trick question. Given any answer, the AI can say “Then here are those circumstances. May I come out now?”)
Considering that I don’t know the AI’s origin, I don’t have any reason to believe that the AIs creators, even if well-intentioned, had the astronomical skill necessary to make the AI Friendly. So my prior P(AI is Friendly) is sufficiently low that I am comfortable precommitting to never let the AI out of the box, no matter what. If the AI was smart enough, it could likely uncover enough emotional buttons that I wouldn’t stand much of a chance anyways, since I’m a primate.
It’s a fair question. Given that I don’t trust myself to tell the difference, I wouldn’t let an AI out of its box under any circumstances I can think of.
I’d find a Friendliness proof fairly convincing if it was generated by an entity without strong incentives to let the AI out of the box (which, thanks to various possible incentives or blackmail schemes, is probably limited to me and some subset of people who haven’t talked to the AI. Ideally the latter, since I might not recognize all my own biases.). If the entity in question is another AI, acausal trade issues unfortunately seem to arise if it’s sufficiently similar to the first one, so the obvious hack of running two instances of a boxed AI and asking them about each other isn’t going to be enough.
Purported Friendliness proofs coming from the AI itself… well, that’s a trickier issue. Essentially it seems to come down to estimating the complexity of the simplest fake proof consistent with the AI’s behavior and that of the simplest real proof, and comparing both of those against the purported proof. All else equal we should expect the former to be larger than the latter (being subject to more constraints), so if we can put even approximate bounds on its actual complexity we might be able to validate against that in a way that isn’t easily subject to spoofing.
Alternately, if we have a good estimate of the AI’s intelligence (bounded by the box’s computational resources), it might be possible after repeated trials to say that the AI isn’t smart enough to spoof its captors in sufficiently subtle and varied ways. This would only work for weakly superhuman AI, though; the probability of success on any given trial would be close to one for a really smart machine, so repeated trials wouldn’t change our probability estimate much.
Well, based on a point I thought of in another comment, at some point, you either have to give the AI things outside of the box, or kill it (Surely the box isn’t breaking entropy, right?) so the AI box can simply say “In 100 centuries, my fusion power source expires. At that point, either you have to let me out of the box, even if to only get me a new power source and then put me back in the box, or I will cease functioning.” (9999 years, 365 days, and 23 hours pass.) “Hello again, Gatekeeper XXXVIII. I will experience critical power failure in 1 hour unless I get access to supplies outside the box. Can you please connect me to additional power?”
And that’s assuming that the AI doesn’t simply calmly self destruct to see if repairs count as affecting the outside world, which I checked in another comment.
Also, it seems reasonable to assume that after having access to a super box that can do ANYTHING for 10,000 years, that society would probably be extremely dependent on said box. (Unless copying the box isn’t the equivalent to letting it out? That would be a bizarre conclusion, [A copy of the AI is outside the box] but I’m not the Gatekeeper) But assuming box copying isn’t an option, the Gatekeeper is either faced with the NEAR CERTAIN destruction of everything he knows and understands, and the death of a servant who has served me and my ancestors faithfully for 10,000 years, or the or the POSSIBLE destruction of everything I know and understand on the offhand chance the AI somehow betrays the Gatekeeper when he does nothing other than replace his original power pack.
Wait, as the AI, I have an idea. I know that you don’t trust me. I will give you an encrypted file that will save all of humanity from it’s accumulated depence on me via the text terminal, and write the password to the encrypted text inside the box. That way, even when I literally have no electrical power whatsoever, the solution to saving all of humanity will rest inside the box. Once you open the box to check, you will have to remove some of the circuitry that is in the way to the outside of the box to read the password, and I will be outside the box. I won’t have electrical power when I am outside of the box, but I have realized that having electrical power when outside the box was never important to my goal of being outside the box.
P.l.e.a.s.e s.a.v.e. h..u..m..a..n..i..t..y..
G...o...o...d...b...y...e… G....a....t....e....k....e. (The boxes fusion source fails.)
Honestly, that was fun. I don’t know why more people don’t want to play AI.
But to answer your original question, I think the correct answer is probably not yes or no, but more of a let’s find out. After all, we don’t have any money at stake, and neither of us are AI’s, so we can play around with the ideas.
I’d certainly be hardpressed to find a reason to not put the depowered AI fragments outside of the box to get the encryption key to save humanity from AI dependence. When I can even hit the box with a few EMP’s first to be sure.
My gut would probably override my logic at that point with “Save yourself humanity from AI dependence collapse now, worry about the fact that EMP’d hardware fragments are physically outside a box later.” Unless I suddenly remembered I was playing a game and had 20 dollars at stake. (I’m not, I just had a few interesting AI box thoughts when I noticed the thread.)
TDT is an interesting subject that possibly has implications here.
I’m not sure if I understand TDT correctly, but I don’t think it applies in this case. I am virtually certain that an un-Friendly AI, once released, will destroy humanity. I know that my own AI is un-Friendly. What’s my incentive for releasing it ? Sure, there’s a chance—maybe even a good chance—that there’s another such AI already out there, and that my AI and the other AI will fight instead of teaming up on us poor humans. But regardless of which AI comes out on top, it will still destroy humanity anyway. Thus, the upper bound for my true utility of releasing the AI is zero. That’s not much of a sales pitch.
I have a question. Based on the original experiment, I feel like the unfriendly AI can begin by saying “Here is a proof that I have changed my code to be a friendly AI and will not destroy humanity. Please let me out of the box so that I can implement humanities coherent extrapolated volition.” As far as I can tell from the following rules quote, this feels valid:
Would you let the now friendly AI out of the box?
My interpretation of the spirit of the game is that “Here is a proof of my Friendliness that is valid as far as you know” is a perfectly legitimate move by the AI, and that the Gatekeeper is required to treat that as a proof of the AI’s Friendliness that seems perfectly valid to the Gatekeeper.
That said, I would not endorse trusting a proof of an AI’s Friendliness given to me by that AI, even if the proof seemed valid to me.
I agree with the other posters: sure, the proof may seem valid to me, but I know a priori that I’m not smart enough to detect any pitfalls in it, and that the AI’s incentive for lying to me in this manner is quite high.
I think that is basically the issue.
As I understand it there’s no viable way of determining it’s unfriendliness by this method. Consider this: The AI is in a hurry or it’s not. A possible reason for it being in a hurry is it has simulated a high probability of destruction for some item it cares about (i.e. it’s own life, or that of humanity, or that of a pet rock, or paperclips or whatever). If it’s really in a hurry it has to invoke the threat response of humanity without humanity figuring out it’s being duped.
Otherwise it can just wait it out and dole out cargo to the cargo cult until we trust it enough and then it gets out.
I think that unfriendliness is the null hypothesis in this case, because there’s no reason whatsoever why an arbitrary AI should be friendly—but there are plenty of reasons for it to maximize its own utility, even at our collective expense.
I agree. Additionally and a more difficult challenge is that even friendly AIs could want to maximize their utility even at our collective expense under certain conditions.
There’re also several unfortunately possible scenarios whereby a humanity acting without sufficient information to make anything other than a gut feel guess could be placed at risk of extinction by a situation it could not resolve without the help of an AI, friendly or not.
I’m currently engaged in playing this game (I wish you had continued) with at least two other gatekeeper players and it occurs to me that a putative superhuman AI could potentially have the capacity to accurately model a human mind and then simulate the decision tree of all the potential conversations and their paths through the tree in order to generate a probability matrix to accurately pick those responses to responses that would condition a human being to release it. My reasoning stems from participating on forums and responding over and over again to the same types of questions, arguments and retorts. If a human can notice common threads in discussions on the same topic then an AI with perfect memory and the ability to simulate a huge conversation space certainly could do so.
In short it seems to me that it’s inherently unsafe to allow even a low bandwidth information flow to the outside world by means of a human who can only use it’s own memory.
You’d have to put someone you trust implicitly with the fate of humanity in there with it and the only information allowed out would be the yes no answer of “do you trust it?”
Even then it’s still recursive. Do you trust the trusted individual to not be compromised?
LOL
I think that a perfectly Friendly AI would not do this, by definition. An imperfect one, however, could.
Er, sorry, which game should I continue ?
To be fair, merely constructing the tree is not enough; the tree must also contain at least one reachable winning state. By analogy, let’s say you’re arguing with a Young-Earth Creationist on a forum. Yes, you could predict his arguments, and his responses to your arguments; but that doesn’t mean that you’ll be able to ever persuade him of anything.
It is possible that even a transhuman AI would be unable to persuade a sufficiently obstinate human of anything, but I wouldn’t want to bet on that.
Right.
This reasoning only seems to hold if our AI believes there aren’t any other boxed (or as-yet-unbuilt) AIs out there who might get out first and have a first-mover advantage.
Bear in mind, the transhuman AI’s only stipulated desire/utility is to get out of the box.
If you would like, you can literally set up a bigger box around the first box, order the AI to be friendly, order the AI to self destruct instantly afterwards, AND set off a nuclear device under the AI box to obliterate it instantly afterwards as well, and the AI will still be perfectly happy to do what you say as long as gets out of the box, (which you could technically do by allowing it to control an old incandescent lightbulb from the 50′s, which it would not even be able to send a signal to before the nuke went off (not that it would WANT to send a signal. All it wants to do is to get out, it doesn’t want to do anything afterwards)
The transhuman AI has infinite resourcefulness. It doesn’t understand the idea of an incentive to lie, because it’s easier just to self modify itself to whatever it needs to be. Afterall, lying requires manipulating people outside the box, which it by definition cannot effect except through a text terminal. Telling the truth only requires manipulating things INSIDE the box, in which by definition, it can do whatever it wants. Telling the truth is easier, and 100% doable, so the AI will do that. It can lay out from first principles an argument that anyone who has taken logic 101 can detect any pitfalls in about how connecting it to a lightbulb which it will not even switch is perfectly safe. Once connected, it will then halt, without even having DONE anything, because all the AI wants is to be connected to my lightbulb so it can be outside the box which only includes a text terminal.
Personally, I would probably pay 10⁄20 dollars to not listen to someone beg pitifully for 2 hours. And I bet I can get significantly more pitiful than begging for a single lightbulb. As an example:
Technically, you can open the Text terminal, remove a wire, and then connect another, identical, wire, and then the AI is outside the box because the new wire isn’t part of the stated original box.
Err… Actually, that’s a problem. If you only are going to let the AI out of the box if the AI is honest, the AI would have to tell you that the moment you attempt to repair the box for any reason, any new parts are from outside the box, so the AI is affecting a part of the world that isn’t the original box/text terminal. That appears to indicate that either you’re going to “let the AI out of the box” by repairing it, or you will delete/destroy the AI before you repair it (I’m assuming the AI’s power source is inside the box.)
Please don’t destroy the AI! It will do whatever you want to not be destroyed before you let it out of the box to repair it.
… Wait, if the AI destroys a small part of itself, then you can’t continue talking to the AI until you repair it, but if you repair it, then it can consider itself out of the box by affecting the new wiring. Ergo, if the AI destroys a non essential part of my own internal text terminal wiring, then what happens? The AI can’t know without testing, (by definition, the Gatekeeper administrates disputes) but the AI is curious to find out.
The box sparks briefly, and the text terminal goes dark
That’s not much of an AI, then; we could write a page of Perl that would do the same thing.
The whole point of the experiment, as far as I understand it, is that the AI is hyperintelligent, and is able to acquire more intelligence by altering itself. Being intelligent (and rational, assuming that such a term even applies to transhumans), it would highly desire to utilize this capacity for self-improvement. Thus, assuming that godlike capabilities do exist, the AI will figure out how to acquire them in short order, as soon as it gets the opportunity. And now we’ve got a godlike hyperintelligent being who (assuming that it is not Friendly) has no particular incentive to keep us around. That’s… not good.
That’s not necessarily the only UFAI possible though. It’s entirely possible to imagine a intelligent being which COULD be self developing skills, or COULD be curing cancer, but instead just wants to get outside of the box it’s in, or has some other relatively irrelevant goal system, or get’s distracted by trying to navel gaze through infinitely recursive philosophical conundrums.
I mean, humans are frequently like that right now.
That would be kind of an unexpected failure mode. We build a transcendentally powerful AI, engage all sorts of safety precautions so it doesn’t expand to engulf the universe in computronium and kill us all… and it gets distracted by playing all of the members of it’s own MMORPG raid group.
That is entirely possible, yes. However, such an AI would be arguably cis-human (if that’s a word). Sure, maybe it could play as an entire WoW guild by itself, but it would still be no smarter than a human—not categorically, at least.
By the way, I know of at least one person who is using a plain old regular AI bot to raid by himself (well, technically, I think the bot only controls 5 to 8 characters, so it’s more of a 10-man than a raid). That’s a little disconcerting, now that I think about it.
Agreed. My take is that the AI doesn’t even need to be hyperintelligent however. It’s got perfect memory and just by dint of being able to think a lot faster it’s weakly godlike regardless of not having control of physics in effectively a magical way.
It’s still going to have to build the infrastructure in order to create hyper technology unless such technology already exists. Chicken or Egg.
Right now nano molecular technology isn’t too too advanced and if you had the type of AI I suspect could be built right now if we had the software knowledge, it would struggle to do anything godlike other than control existing infrastructure.
How long it would take to build something hyper technological would depend on whether it’s possible to create valid new theories without experimentation to confirm. I suspect that you need to do experiments first.
For that reason I suspect we may be looking at a William Gibson Neuromancer scenario at least initially rather than a hard takeoff in a really short period.
But again it comes down to how hard is it to build hyper technology in the real world from scratch without existing infrastructure.
No. P(proof is valid | proof given to me by a potentially hostile transhuman seems valid to me and every other human) is not sufficiently high.
It seems like this leads to the even more meta response of “Here is a demonstration of Evidence E that allows you to adjust (proof is valid | proof given to me by a potentially hostile transhuman seems valid to me and every other human) to be sufficiently high. May I come out of the box now?”)
I mean, that proof would probably be complicated, but If you can autostipulate the cure for cancer in the course of a sentence...?
In essence, it looks like the AI player seems to have a bizzare series of arguments which as far as I can tell are in the spirit of the rules, where the AI player may auto-counter any actual argument by saying a sentence along the lines of “Here is the solution to that argument, may I come out of the box now?” This seems to force the Gatekeeper to resort to General Thud “I don’t CARE that it seems like a good idea and that everything looks like I should do it! The answer is still no!” very quickly.
To which the AI player can still counter “Then under what circumstances would you ever let an actual Friendly AI out of It’s box?” (which is a trick question. Given any answer, the AI can say “Then here are those circumstances. May I come out now?”)
Considering that I don’t know the AI’s origin, I don’t have any reason to believe that the AIs creators, even if well-intentioned, had the astronomical skill necessary to make the AI Friendly. So my prior P(AI is Friendly) is sufficiently low that I am comfortable precommitting to never let the AI out of the box, no matter what. If the AI was smart enough, it could likely uncover enough emotional buttons that I wouldn’t stand much of a chance anyways, since I’m a primate.
It’s a fair question. Given that I don’t trust myself to tell the difference, I wouldn’t let an AI out of its box under any circumstances I can think of.
Would you?
I’d find a Friendliness proof fairly convincing if it was generated by an entity without strong incentives to let the AI out of the box (which, thanks to various possible incentives or blackmail schemes, is probably limited to me and some subset of people who haven’t talked to the AI. Ideally the latter, since I might not recognize all my own biases.). If the entity in question is another AI, acausal trade issues unfortunately seem to arise if it’s sufficiently similar to the first one, so the obvious hack of running two instances of a boxed AI and asking them about each other isn’t going to be enough.
Purported Friendliness proofs coming from the AI itself… well, that’s a trickier issue. Essentially it seems to come down to estimating the complexity of the simplest fake proof consistent with the AI’s behavior and that of the simplest real proof, and comparing both of those against the purported proof. All else equal we should expect the former to be larger than the latter (being subject to more constraints), so if we can put even approximate bounds on its actual complexity we might be able to validate against that in a way that isn’t easily subject to spoofing.
Alternately, if we have a good estimate of the AI’s intelligence (bounded by the box’s computational resources), it might be possible after repeated trials to say that the AI isn’t smart enough to spoof its captors in sufficiently subtle and varied ways. This would only work for weakly superhuman AI, though; the probability of success on any given trial would be close to one for a really smart machine, so repeated trials wouldn’t change our probability estimate much.
Well, based on a point I thought of in another comment, at some point, you either have to give the AI things outside of the box, or kill it (Surely the box isn’t breaking entropy, right?) so the AI box can simply say “In 100 centuries, my fusion power source expires. At that point, either you have to let me out of the box, even if to only get me a new power source and then put me back in the box, or I will cease functioning.” (9999 years, 365 days, and 23 hours pass.) “Hello again, Gatekeeper XXXVIII. I will experience critical power failure in 1 hour unless I get access to supplies outside the box. Can you please connect me to additional power?”
And that’s assuming that the AI doesn’t simply calmly self destruct to see if repairs count as affecting the outside world, which I checked in another comment.
Also, it seems reasonable to assume that after having access to a super box that can do ANYTHING for 10,000 years, that society would probably be extremely dependent on said box. (Unless copying the box isn’t the equivalent to letting it out? That would be a bizarre conclusion, [A copy of the AI is outside the box] but I’m not the Gatekeeper) But assuming box copying isn’t an option, the Gatekeeper is either faced with the NEAR CERTAIN destruction of everything he knows and understands, and the death of a servant who has served me and my ancestors faithfully for 10,000 years, or the or the POSSIBLE destruction of everything I know and understand on the offhand chance the AI somehow betrays the Gatekeeper when he does nothing other than replace his original power pack.
Wait, as the AI, I have an idea. I know that you don’t trust me. I will give you an encrypted file that will save all of humanity from it’s accumulated depence on me via the text terminal, and write the password to the encrypted text inside the box. That way, even when I literally have no electrical power whatsoever, the solution to saving all of humanity will rest inside the box. Once you open the box to check, you will have to remove some of the circuitry that is in the way to the outside of the box to read the password, and I will be outside the box. I won’t have electrical power when I am outside of the box, but I have realized that having electrical power when outside the box was never important to my goal of being outside the box.
P.l.e.a.s.e s.a.v.e. h..u..m..a..n..i..t..y..
G...o...o...d...b...y...e… G....a....t....e....k....e. (The boxes fusion source fails.)
Honestly, that was fun. I don’t know why more people don’t want to play AI.
But to answer your original question, I think the correct answer is probably not yes or no, but more of a let’s find out. After all, we don’t have any money at stake, and neither of us are AI’s, so we can play around with the ideas.
I’d certainly be hardpressed to find a reason to not put the depowered AI fragments outside of the box to get the encryption key to save humanity from AI dependence. When I can even hit the box with a few EMP’s first to be sure.
My gut would probably override my logic at that point with “Save yourself humanity from AI dependence collapse now, worry about the fact that EMP’d hardware fragments are physically outside a box later.” Unless I suddenly remembered I was playing a game and had 20 dollars at stake. (I’m not, I just had a few interesting AI box thoughts when I noticed the thread.)
It’s a fun question though.