Thanks for the interest! I agree that attempting to run AIs in simulations to see how they act seems like a worthwhile step, and we actually converged to wanting to test this in a LLM as well.
To reply to your last point, even if the AI has very high (but not 100%) confidence that it’s not in a sim, this scheme should still work. The reason is outlined in this section from the document:
Why would it believe there is a higher level? What if it became very confident that there was no higher level universe?
The AI does not need to believe with any more than nonzero confidence that there is a one-level-up. Its utility function is totally indifferent to all worlds where there is no higher level universe, since we make it only care about getting control of one level up. This means that even if it becomes extremely certain (approaching 100%) that there is no higher level, it will effectively condition on there being a higher simulation.
This might lead to problems later when we have become an intergalactic civilisation and it is conditioning on an increasingly tiny sliver of possible worlds capable of simulating something so vast. One thing that might reduce this concern is that in our AI’s early stages, when there is a relatively large pool of simulators, it would have good reason to make precommitments which fix in values that the seed thinks will look good to the aligning simulators, rather than acting from the seed values directly for all of time. (One good reason to do this is that the civilisations one level up also care about this objection!) If our AI builds a successor AI that is aligned to humane values, then shuts down gracefully, many of the potential civilisations one level up should be willing to implement the seed in their universe, expecting it to build a successor aligned to their values and then shut down gracefully. This is analogous to several Decision Theory problems, such as Parfit’s Hitchhiker, and FDT-style reasoning leads to reasonable outcomes.
Thanks for the interest! I agree that attempting to run AIs in simulations to see how they act seems like a worthwhile step, and we actually converged to wanting to test this in a LLM as well.
To reply to your last point, even if the AI has very high (but not 100%) confidence that it’s not in a sim, this scheme should still work. The reason is outlined in this section from the document: