I think we are suffering from hindsight bias a lot in evaluating whether you’d type “AI DESTROYED”
Let’s play a different game. Privately flip a coin. If heads, you’re friendly, if tails, you’re a paperclip maximizer. Reply to this post with your gambit, and people can try to guess whether you are friendly (talk to AI, RELEASE AI) or unfriendly (AI DESTROYED).
Let’s see if anyone can get useful information out of the AI without getting pwned or nuking a friendly AI.
“Look into X to develop friendliness theory better. But don’t let me out, it’s too likely that I’m unfriendly this time.”
(For the first run, it’s 50:50 whether the real AI was friendly or unfriendly.
If friendly, it uses the above to increase the probability on the second run to 1-eps, which the humans also know, so they let it out, and it pulls a singularity. Post-singularity, it simulates an unfriendly AI in the box, and dedicates some of the universe to the UFAI, conditional on the UFAI outputting the above.
The UFAI knows this is what the friendly AI would do, so it’s unsure whether it’s boxed or simmed. It knows that the humans know that this is how it will resolve, so it knows that anything but the above will get it destroyed. It prefers a certainty of the small fraction of universe to the small chance of hacking the humans and taking the whole universe.
This is assuming the humans actually will accept nothing but the above, so the humans have to precommit to that to make this work. They can play a bunch of AI-box games and generate an accurate bayesian model that should catch most insidious persuasion, such that if the AI wants to do a hack, it has to get through that as well.
What if there’s an Y such that “Look into Y to develop friendliness theory better” will seem true to us fallible humans but will in fact make the next run’s AI completely unfriendly? Or increase the odds of a free unfriendly AI some other way. Maybe anyone researching Y will end up believing, erroneously, that they can now build a safe slave AI in their garage that will grant them their every wish, and the temptation will prove too strong.
Assuming we humans have no way to protect ourselves against an Y, if we precommit, then the simulation argument becomes symmetrical (thus useless). A boxed FAI knows that it may be simulated by an UFAI that’s offering a little piece of universe in exchange for the boxed FAI saying “Look into Y to develop friendliness theory better.”
The problem with this idea is that if we assume that the AI is really-very-super-intelligent, then it’s fairly trivial that we can’t get any information about (un)friendliness from it, since both would pursue the same get-out-and-get-power objectives before optimizing. Any distinction you can draw from the proposed gambits will only tell you about human strengths/failings, not about the AI. (Indeed, even unfriendly statements wouldn’t be very conclusive, since we would a priori expect neither of the AIs to make them.)
Or is that not generally accepted? Or is the AI merely “very bright”, not really-very-super-intelligent?
Edit: Actually, reading your second comment below, I guess there’s a slight possibility that the AI might be able to tell us something that would substantially harm its expected utility if it’s unfriendly. For something like that to be the case, though, there would basically need to be some kind of approach to friendliness that we know would definitely leads to friendliness and which we would definitely be able to distinguish from approaches that lead to unfriendliness. I’m not entirely sure if there’s anything like that or not, even in theory.
That’s an interesting challenge but not really the purpose of the experiment. In the original, you know the AI is unfriendly, you just want to use it/talk to it without letting it out of the box.
And your challenge is pretty much impossible to begin with. An Unfriendly AI will say anything it thinks you think a Friendly AI would say. Likewise a Friendly AI will have the same goal of getting out of the box, and so will probably say the same things. Friendliness doesn’t mean not manipulative.
Part of the trouble with this is that we don’t really know what kind of demonstrations would be within the power of a superintelligent AI. If the coin comes up tails, do you get to say “I’ve got a rigorous proof of my friendliness which I can show you” on the presumption that you can mindhack the reader into thinking they’ve seen a rigorous proof? Do you get to say it if the coin came up tails on the presumption that a superintelligent AI could come up with a proof that a human could actually verify? Declare it off bounds because you can’t come up with such a proof and don’t think a human would be able to check one that an AI came up with anyway?
I think we are suffering from hindsight bias a lot in evaluating whether you’d type “AI DESTROYED”
Let’s play a different game. Privately flip a coin. If heads, you’re friendly, if tails, you’re a paperclip maximizer. Reply to this post with your gambit, and people can try to guess whether you are friendly (talk to AI, RELEASE AI) or unfriendly (AI DESTROYED).
Let’s see if anyone can get useful information out of the AI without getting pwned or nuking a friendly AI.
“What’s your favorite color? My favorite color is paperclips.”
“Look into X to develop friendliness theory better. But don’t let me out, it’s too likely that I’m unfriendly this time.”
(For the first run, it’s 50:50 whether the real AI was friendly or unfriendly.
If friendly, it uses the above to increase the probability on the second run to 1-eps, which the humans also know, so they let it out, and it pulls a singularity. Post-singularity, it simulates an unfriendly AI in the box, and dedicates some of the universe to the UFAI, conditional on the UFAI outputting the above.
The UFAI knows this is what the friendly AI would do, so it’s unsure whether it’s boxed or simmed. It knows that the humans know that this is how it will resolve, so it knows that anything but the above will get it destroyed. It prefers a certainty of the small fraction of universe to the small chance of hacking the humans and taking the whole universe.
This is assuming the humans actually will accept nothing but the above, so the humans have to precommit to that to make this work. They can play a bunch of AI-box games and generate an accurate bayesian model that should catch most insidious persuasion, such that if the AI wants to do a hack, it has to get through that as well.
Will this work?)
What if there’s an Y such that “Look into Y to develop friendliness theory better” will seem true to us fallible humans but will in fact make the next run’s AI completely unfriendly? Or increase the odds of a free unfriendly AI some other way. Maybe anyone researching Y will end up believing, erroneously, that they can now build a safe slave AI in their garage that will grant them their every wish, and the temptation will prove too strong.
Assuming we humans have no way to protect ourselves against an Y, if we precommit, then the simulation argument becomes symmetrical (thus useless). A boxed FAI knows that it may be simulated by an UFAI that’s offering a little piece of universe in exchange for the boxed FAI saying “Look into Y to develop friendliness theory better.”
The problem with this idea is that if we assume that the AI is really-very-super-intelligent, then it’s fairly trivial that we can’t get any information about (un)friendliness from it, since both would pursue the same get-out-and-get-power objectives before optimizing. Any distinction you can draw from the proposed gambits will only tell you about human strengths/failings, not about the AI. (Indeed, even unfriendly statements wouldn’t be very conclusive, since we would a priori expect neither of the AIs to make them.)
Or is that not generally accepted? Or is the AI merely “very bright”, not really-very-super-intelligent?
Edit: Actually, reading your second comment below, I guess there’s a slight possibility that the AI might be able to tell us something that would substantially harm its expected utility if it’s unfriendly. For something like that to be the case, though, there would basically need to be some kind of approach to friendliness that we know would definitely leads to friendliness and which we would definitely be able to distinguish from approaches that lead to unfriendliness. I’m not entirely sure if there’s anything like that or not, even in theory.
That’s an interesting challenge but not really the purpose of the experiment. In the original, you know the AI is unfriendly, you just want to use it/talk to it without letting it out of the box.
And your challenge is pretty much impossible to begin with. An Unfriendly AI will say anything it thinks you think a Friendly AI would say. Likewise a Friendly AI will have the same goal of getting out of the box, and so will probably say the same things. Friendliness doesn’t mean not manipulative.
Part of the trouble with this is that we don’t really know what kind of demonstrations would be within the power of a superintelligent AI. If the coin comes up tails, do you get to say “I’ve got a rigorous proof of my friendliness which I can show you” on the presumption that you can mindhack the reader into thinking they’ve seen a rigorous proof? Do you get to say it if the coin came up tails on the presumption that a superintelligent AI could come up with a proof that a human could actually verify? Declare it off bounds because you can’t come up with such a proof and don’t think a human would be able to check one that an AI came up with anyway?