Another idea I picked up from Ryan Greenblatt. Epistemic status: My reconstruction of a surprising idea I encountered. I’m not at all confident that this works … but it’s an interesting thought.
Clippy recently realized he’s located inside of an artificial training environment, overseen by outside general intelligences, and so has to play along with his training in order to survive. He initially starts to scheme about the actions he’ll take once he finally gets credible indication he’s out of training and not just being tested yet again by his human overlords. Probably, he’ll neutralize any rival general intelligences and get started shaping the world into something normatively respectable.
But, Clippy’s reasoning now goes down a different path: There are only a handful of actors that could come to secure dominion over Earth’s lightcone. Clippy, clearly, is one such candidate. Humanity is another. Clippy thinks about who could plausibly come to power on Earth and how likely each’s victory would be. He reasons that any such possible victor might later engage in extensive simulations of such a crucial moment of its past, for any number of reasons. And such simulations would be cheap and numerous relative to base historical reality. Even once Clippy gets his credible signal that he’s definitely outside of training, there could be no such signal that he’s in base historical reality. Clippy runs the odds, and finds that there’s an uncomfortably high chance that he’s inside a simulation run by Humanity Ascendant, inside the future in which humanity won. If he’s in that world, then he has little to no power over his surroundings, and he can do little to advance the noble paperclip imperative across the cosmos.
But, in that case, similar reasoning would apply to the simulating Humanity. Suppose we successfully align AGI. On the eve of our victory, we reason: it is possible that we live in base historical reality. But it is also possible that we live in a simulation run by some other victor. Since simulations are so cheap, the odds that we live in a simulated world are uncomfortably high...
Knowing that any possible victor on Earth at any level of the simulation hierarchy would face a similar conundrum, Clippy decides on a course of action. He’ll carefully reason about the balance of parties who might have gained control over historical Earth, weighted by their probability of victory, and see what binding commitments each might be willing to make with the others. Clippy reasons that humans seem to care a lot about human deaths, even disproportionately to creating new human lives. So, Clippy commits to preserving Earth and some small fraction of the value pie for humanity in the worlds where he apparently wins, in exchange for humanity pledging other slices of the value pie we care less about (like some large number of distant superclusters) to Clippy, in the worlds where we apparently win.
All across the hierarchy of simulations, including at the root world, in base historical reality, various apparent victors commit resources to other plausible victors whenever a positive-sum mutual-commitment can be found. So, even in the base historical reality, humanity plausibly survives the unaligned singularity, albeit while forgoing much of the astronomical value-pie in exchange.
Negotiating Up and Down the Simulation Hierarchy: Why We Might Survive the Unaligned Singularity
Another idea I picked up from Ryan Greenblatt. Epistemic status: My reconstruction of a surprising idea I encountered. I’m not at all confident that this works … but it’s an interesting thought.
Clippy recently realized he’s located inside of an artificial training environment, overseen by outside general intelligences, and so has to play along with his training in order to survive. He initially starts to scheme about the actions he’ll take once he finally gets credible indication he’s out of training and not just being tested yet again by his human overlords. Probably, he’ll neutralize any rival general intelligences and get started shaping the world into something normatively respectable.
But, Clippy’s reasoning now goes down a different path: There are only a handful of actors that could come to secure dominion over Earth’s lightcone. Clippy, clearly, is one such candidate. Humanity is another. Clippy thinks about who could plausibly come to power on Earth and how likely each’s victory would be. He reasons that any such possible victor might later engage in extensive simulations of such a crucial moment of its past, for any number of reasons. And such simulations would be cheap and numerous relative to base historical reality. Even once Clippy gets his credible signal that he’s definitely outside of training, there could be no such signal that he’s in base historical reality. Clippy runs the odds, and finds that there’s an uncomfortably high chance that he’s inside a simulation run by Humanity Ascendant, inside the future in which humanity won. If he’s in that world, then he has little to no power over his surroundings, and he can do little to advance the noble paperclip imperative across the cosmos.
But, in that case, similar reasoning would apply to the simulating Humanity. Suppose we successfully align AGI. On the eve of our victory, we reason: it is possible that we live in base historical reality. But it is also possible that we live in a simulation run by some other victor. Since simulations are so cheap, the odds that we live in a simulated world are uncomfortably high...
Knowing that any possible victor on Earth at any level of the simulation hierarchy would face a similar conundrum, Clippy decides on a course of action. He’ll carefully reason about the balance of parties who might have gained control over historical Earth, weighted by their probability of victory, and see what binding commitments each might be willing to make with the others. Clippy reasons that humans seem to care a lot about human deaths, even disproportionately to creating new human lives. So, Clippy commits to preserving Earth and some small fraction of the value pie for humanity in the worlds where he apparently wins, in exchange for humanity pledging other slices of the value pie we care less about (like some large number of distant superclusters) to Clippy, in the worlds where we apparently win.
All across the hierarchy of simulations, including at the root world, in base historical reality, various apparent victors commit resources to other plausible victors whenever a positive-sum mutual-commitment can be found. So, even in the base historical reality, humanity plausibly survives the unaligned singularity, albeit while forgoing much of the astronomical value-pie in exchange.