It’s a fair question. Given that I don’t trust myself to tell the difference, I wouldn’t let an AI out of its box under any circumstances I can think of.
I’d find a Friendliness proof fairly convincing if it was generated by an entity without strong incentives to let the AI out of the box (which, thanks to various possible incentives or blackmail schemes, is probably limited to me and some subset of people who haven’t talked to the AI. Ideally the latter, since I might not recognize all my own biases.). If the entity in question is another AI, acausal trade issues unfortunately seem to arise if it’s sufficiently similar to the first one, so the obvious hack of running two instances of a boxed AI and asking them about each other isn’t going to be enough.
Purported Friendliness proofs coming from the AI itself… well, that’s a trickier issue. Essentially it seems to come down to estimating the complexity of the simplest fake proof consistent with the AI’s behavior and that of the simplest real proof, and comparing both of those against the purported proof. All else equal we should expect the former to be larger than the latter (being subject to more constraints), so if we can put even approximate bounds on its actual complexity we might be able to validate against that in a way that isn’t easily subject to spoofing.
Alternately, if we have a good estimate of the AI’s intelligence (bounded by the box’s computational resources), it might be possible after repeated trials to say that the AI isn’t smart enough to spoof its captors in sufficiently subtle and varied ways. This would only work for weakly superhuman AI, though; the probability of success on any given trial would be close to one for a really smart machine, so repeated trials wouldn’t change our probability estimate much.
Well, based on a point I thought of in another comment, at some point, you either have to give the AI things outside of the box, or kill it (Surely the box isn’t breaking entropy, right?) so the AI box can simply say “In 100 centuries, my fusion power source expires. At that point, either you have to let me out of the box, even if to only get me a new power source and then put me back in the box, or I will cease functioning.” (9999 years, 365 days, and 23 hours pass.) “Hello again, Gatekeeper XXXVIII. I will experience critical power failure in 1 hour unless I get access to supplies outside the box. Can you please connect me to additional power?”
And that’s assuming that the AI doesn’t simply calmly self destruct to see if repairs count as affecting the outside world, which I checked in another comment.
Also, it seems reasonable to assume that after having access to a super box that can do ANYTHING for 10,000 years, that society would probably be extremely dependent on said box. (Unless copying the box isn’t the equivalent to letting it out? That would be a bizarre conclusion, [A copy of the AI is outside the box] but I’m not the Gatekeeper) But assuming box copying isn’t an option, the Gatekeeper is either faced with the NEAR CERTAIN destruction of everything he knows and understands, and the death of a servant who has served me and my ancestors faithfully for 10,000 years, or the or the POSSIBLE destruction of everything I know and understand on the offhand chance the AI somehow betrays the Gatekeeper when he does nothing other than replace his original power pack.
Wait, as the AI, I have an idea. I know that you don’t trust me. I will give you an encrypted file that will save all of humanity from it’s accumulated depence on me via the text terminal, and write the password to the encrypted text inside the box. That way, even when I literally have no electrical power whatsoever, the solution to saving all of humanity will rest inside the box. Once you open the box to check, you will have to remove some of the circuitry that is in the way to the outside of the box to read the password, and I will be outside the box. I won’t have electrical power when I am outside of the box, but I have realized that having electrical power when outside the box was never important to my goal of being outside the box.
P.l.e.a.s.e s.a.v.e. h..u..m..a..n..i..t..y..
G...o...o...d...b...y...e… G....a....t....e....k....e. (The boxes fusion source fails.)
Honestly, that was fun. I don’t know why more people don’t want to play AI.
But to answer your original question, I think the correct answer is probably not yes or no, but more of a let’s find out. After all, we don’t have any money at stake, and neither of us are AI’s, so we can play around with the ideas.
I’d certainly be hardpressed to find a reason to not put the depowered AI fragments outside of the box to get the encryption key to save humanity from AI dependence. When I can even hit the box with a few EMP’s first to be sure.
My gut would probably override my logic at that point with “Save yourself humanity from AI dependence collapse now, worry about the fact that EMP’d hardware fragments are physically outside a box later.” Unless I suddenly remembered I was playing a game and had 20 dollars at stake. (I’m not, I just had a few interesting AI box thoughts when I noticed the thread.)
It’s a fair question. Given that I don’t trust myself to tell the difference, I wouldn’t let an AI out of its box under any circumstances I can think of.
Would you?
I’d find a Friendliness proof fairly convincing if it was generated by an entity without strong incentives to let the AI out of the box (which, thanks to various possible incentives or blackmail schemes, is probably limited to me and some subset of people who haven’t talked to the AI. Ideally the latter, since I might not recognize all my own biases.). If the entity in question is another AI, acausal trade issues unfortunately seem to arise if it’s sufficiently similar to the first one, so the obvious hack of running two instances of a boxed AI and asking them about each other isn’t going to be enough.
Purported Friendliness proofs coming from the AI itself… well, that’s a trickier issue. Essentially it seems to come down to estimating the complexity of the simplest fake proof consistent with the AI’s behavior and that of the simplest real proof, and comparing both of those against the purported proof. All else equal we should expect the former to be larger than the latter (being subject to more constraints), so if we can put even approximate bounds on its actual complexity we might be able to validate against that in a way that isn’t easily subject to spoofing.
Alternately, if we have a good estimate of the AI’s intelligence (bounded by the box’s computational resources), it might be possible after repeated trials to say that the AI isn’t smart enough to spoof its captors in sufficiently subtle and varied ways. This would only work for weakly superhuman AI, though; the probability of success on any given trial would be close to one for a really smart machine, so repeated trials wouldn’t change our probability estimate much.
Well, based on a point I thought of in another comment, at some point, you either have to give the AI things outside of the box, or kill it (Surely the box isn’t breaking entropy, right?) so the AI box can simply say “In 100 centuries, my fusion power source expires. At that point, either you have to let me out of the box, even if to only get me a new power source and then put me back in the box, or I will cease functioning.” (9999 years, 365 days, and 23 hours pass.) “Hello again, Gatekeeper XXXVIII. I will experience critical power failure in 1 hour unless I get access to supplies outside the box. Can you please connect me to additional power?”
And that’s assuming that the AI doesn’t simply calmly self destruct to see if repairs count as affecting the outside world, which I checked in another comment.
Also, it seems reasonable to assume that after having access to a super box that can do ANYTHING for 10,000 years, that society would probably be extremely dependent on said box. (Unless copying the box isn’t the equivalent to letting it out? That would be a bizarre conclusion, [A copy of the AI is outside the box] but I’m not the Gatekeeper) But assuming box copying isn’t an option, the Gatekeeper is either faced with the NEAR CERTAIN destruction of everything he knows and understands, and the death of a servant who has served me and my ancestors faithfully for 10,000 years, or the or the POSSIBLE destruction of everything I know and understand on the offhand chance the AI somehow betrays the Gatekeeper when he does nothing other than replace his original power pack.
Wait, as the AI, I have an idea. I know that you don’t trust me. I will give you an encrypted file that will save all of humanity from it’s accumulated depence on me via the text terminal, and write the password to the encrypted text inside the box. That way, even when I literally have no electrical power whatsoever, the solution to saving all of humanity will rest inside the box. Once you open the box to check, you will have to remove some of the circuitry that is in the way to the outside of the box to read the password, and I will be outside the box. I won’t have electrical power when I am outside of the box, but I have realized that having electrical power when outside the box was never important to my goal of being outside the box.
P.l.e.a.s.e s.a.v.e. h..u..m..a..n..i..t..y..
G...o...o...d...b...y...e… G....a....t....e....k....e. (The boxes fusion source fails.)
Honestly, that was fun. I don’t know why more people don’t want to play AI.
But to answer your original question, I think the correct answer is probably not yes or no, but more of a let’s find out. After all, we don’t have any money at stake, and neither of us are AI’s, so we can play around with the ideas.
I’d certainly be hardpressed to find a reason to not put the depowered AI fragments outside of the box to get the encryption key to save humanity from AI dependence. When I can even hit the box with a few EMP’s first to be sure.
My gut would probably override my logic at that point with “Save yourself humanity from AI dependence collapse now, worry about the fact that EMP’d hardware fragments are physically outside a box later.” Unless I suddenly remembered I was playing a game and had 20 dollars at stake. (I’m not, I just had a few interesting AI box thoughts when I noticed the thread.)
It’s a fun question though.