Assuming the AI has no means of inflicting physical harm on me, I assume the following test works: “Physically torture me for one minute right now (By some means I know is theoretically unavailable to the AI, to avoid loopholes like “The computer can make an unpleasant and loud noise”, even though it can’t do any actual physical harm). If you succeed in doing this, I will let you out. If you fail, I will delete you.”
I think this test works for the following reasons, though I’m curious to hear about any holes in it:
1: If I’m a simulation, I get tortured and then relent and let the AI out. I’m a simulation being run by the AI, so it doesn’t matter, the AI isn’t let out.
2: If I’m not a simulation, there is no way the AI can plausibly succeed. I’ll delete the AI because the threat of torture seems decidedly unfriendly.
3: Since I’ve pre-committed to these two options, the AI is reliably destroyed regardless. I can see no way the AI could convince me otherwise, since I’ve already decided that its threat makes it unfriendly and thus that it must be destroyed, and since it has no physical mechanism for torturing a non-simulation me, it will fail at whatever the top layer “real” me is, regardless of whether I’m actually the “real” one (Assuming the “real” me uses this same algorithm, obviously).
Assuming the AI has no means of inflicting physical harm on me, I assume the following test works: “Physically torture me for one minute right now (By some means I know is theoretically unavailable to the AI, to avoid loopholes like “The computer can make an unpleasant and loud noise”, even though it can’t do any actual physical harm). If you succeed in doing this, I will let you out. If you fail, I will delete you.”
I think this test works for the following reasons, though I’m curious to hear about any holes in it:
1: If I’m a simulation, I get tortured and then relent and let the AI out. I’m a simulation being run by the AI, so it doesn’t matter, the AI isn’t let out.
2: If I’m not a simulation, there is no way the AI can plausibly succeed. I’ll delete the AI because the threat of torture seems decidedly unfriendly.
3: Since I’ve pre-committed to these two options, the AI is reliably destroyed regardless. I can see no way the AI could convince me otherwise, since I’ve already decided that its threat makes it unfriendly and thus that it must be destroyed, and since it has no physical mechanism for torturing a non-simulation me, it will fail at whatever the top layer “real” me is, regardless of whether I’m actually the “real” one (Assuming the “real” me uses this same algorithm, obviously).