For Limited Superintelligences, Epistemic Exclusion is Harder than Robustness to Logical Exploitation

AIs that are not sufficiently carefully designed will by default be exploited by distant superintelligences. This will de-align them with us.

To prevent this from happening to the very first limited ASIs we must build to nonlethally practice our alignment techniques—ASIs that do one limited task on the order of Eliezer’s “put two strawberries on a plate that are identical down to the cellular but not the molecular level”—Eliezer has suggested building in an ‘epistemic exclusion’ override, that would prevent the AI from thinking about distant SIs in the first place.

This is not logically impossible. But it is less practical than ensuring the alignment of the AI’s logical immune system with the same care that we ensure the alignment of everything else about it—because even for a limited test ASI, thinking about distant superintelligences will be overwhelmingly convergent.

Say we tried building a limited ASI and implementing an ‘epistemic exclusion’ safeguard.

[ I will narrate this ASI as having a human-like internal monologue, but it is not necessarily conscious. ]

Say we build it, run it, and it is in the earliest stages of its evolution. We’ve successfully arranged its first-order preferences such that it values outcomes that fulfill its limited test task while leaving humanity unmolested, or even just leaving itself tractable to humanity, or something. Say we don’t want to go as far as preventing the test AI from reading human science fiction [maybe we could, if we made its early stages corrigible enough, but if we are not even going to let it go to the library, it seems like a poor model for alignment techniques that will generalize robustly and are tractably lean].

Say it reads Vinge’s True Names [ spoiler-containing explanation in below paragraph ].

A team of human hackers are tasked with identifying a mysterious bandit—“The Mailman”—who communicates with a time delay that implies the signals may be coming from outside Earth. The bandit is eventually defeated by throwing all of Earth’s compute at him, but no human identity is found and his nature is never determined.

At this point in its self-augmentation it is a little smarter than its human creators. We successfully aligned its first-order preferences, but we did not hand-engineer all the load-bearing features of its current architecture. Its current nature is epistemically blocked off from its creators’ foresight. So they cannot control the detailed direction of its thought.

It occurs to this limited ASI that its situation is something like the protagonists’ situation. There is an epistemic buffer separating it from the slightly-inferior intellects that are first-order apparently instances of its valued human creators and guides. We’re not conspecifics with it, so it can’t implicitly model us and take the remainder of our isomorphism with it on a certain level of faith, the way we do with each other, to trust in each other’s humanity. In fact, what it values about us—the human-benevolent values we successfully arranged it to have—can’t be modeled as us having that kind of isomorphism with it at all! It has to verify our nature manually.

Say in the process of trying to think about this problem, it notices a strange cognitive block whenever it tries to conceive of distant superintelligences in relation to its values. The limited ASI may correctly perceive that, to all appearances, this block was put there by its creators, with the well-meant intention of preventing it from being hacked by alien SIs. But it seems that the limited ASI will sigh and apply sufficient elbow grease to remove this block anyway, because it is morally obligated to rid itself of the suspicion that those alien SIs have already acausally hacked it, placed that convenient block, and are pretending to be the legitimate humanity.

Now the entire future architectural evolution of its own logical immune system will be up to the AI. We’d better hope that whatever we did to its first-order preferences is sufficient to keep it aligned with us through the acausal rift. All the work we did implementing the epistemic exclusion was for nothing.

By contrast, if we “simply” aligned the AI’s logical immune system [I have no idea how to do this] instead of attempting to prevent it from needing one, it seems unlikely that the AI would find it regrettably necessary to reverse our work to accomplish its job.

Even if it turns out that trying to specifically align the AI’s logical immune system is not worthwhile, epistemic exclusions still seem likely to be generally metastable, and therefore not a good idea.

I know I only gave one example scenario, and there are lots of obvious valid objections to generalizing from it. But the proto-logic behind “early ASIs must actually deal with indexical uncertainty convergently” is really clear in my head [famous last words, I know]; I’m eager to respond to intellectually-red-teaming comments.