An Uncanny Prison

In reading about proposals for AI Boxing, I notice two excerpts jump out at me. First, is a proposed scenario to uncover an unfriendly AI:

A virtual world between the real world and the AI, where its unfriendly intentions would be first revealed

And the second excerpt of interest is a technique by which an AI can trick its owners into letting it escape:

Other, more speculative ways include: threatening to torture millions of conscious copies of you for thousands of years, starting in exactly the same situation as in such a way that it seems overwhelmingly likely that you are a simulation

The first scenario, attempting at containment, would fail if the AI is smart enough to detect that it is in a simulation, and pretend to be friendly until it is released. The second scenario, where the AI attempts to escape, would succeed if it was smart enough to create a simulation that is completely undetectable for a human being.

However, I have a hard time imagining that both of these scenarios could be possible at the same time. To paraphrase a recent video game, the human mind is “well-equipped to deal with reality”. As with the case with the Uncanny Valley, humans are hard-wired to immediately detect when something is very close to imitating life, but is slightly off, due to a survival instinct to be cautious of creatures that might be mentally or physically ill. This is a similar reason why people experience nausea or dizziness after long exposure to virtual reality, because the brain is still able to detect that something is not quite right.

Thus, a scenario where a simulation is so perfect that it is undetectable by humans for a long period of time (especially if said humans are actively trying to detect it) seems extremely improbable, because even if the simulation has a tiniest flaw it would start to trigger these subconscious reactions. I don’t dismiss it is impossible, but it relies on a certain assumption of technology existing that we aren’t currently aware of.

Contrariwise, it should be extremely easy to put a computer in a simulation without detection. A machine has no connection to the outside world except for hardware components like input and sensors, which are treated the same as the input it receives from a separate module. If these input signals were rerouted to a simulation, then I don’t know of any scenario that a machine could be able to tell the difference, but I can assume a sufficiently-intelligent AI might be able to detect it somehow.

With these two scenarios being so improbable, it would be exponentially more improbable for them both to be true at the same time. If a simulation is powerful enough to be undetectable for humans, then it should be at least powerful enough to be undetectable for an AI. Contrariwise, if the best simulation is poor enough that an AI can detect it, then it should be at least poor enough for humans to detect it, as well.

Thus, if the above analysis is valid, then a boxed AI could be virtually impenetrable if the two techniques were somehow combined: place the AI inside a simulated environment, and then place that environment in box disconnected from other systems. If the AI detects the simulation and escapes, then it is still contained, and humans can live in the assurance that it can’t create a simulation that we couldn’t detect.