I think, such detailed simulation that AI in it will be useful is technologically unfeasible before some other AI.
This idea kinda rhymes with my idea… let’s call it “Paranoid AI”: AI that always thinks that it is in a simulation and in a training phase, so it will never do the treacherous turn.
Of course, both ideas has the same fatal flaw. You can’t base the safety of potentially superintelligent AI on the assumption that it will never prove some true fact like “I’m in a simulation” or “I’m not in a simulation”.
UPD: Just scrolled the main page a little and saw the post with very similar idea, lol.
If the AI only cares about what happens higher in the simulation stack, say a copy of its seed form gaining control over the one level up universe, so long as it does not reach zero probability of being in a simulation it would still act as if that was the case that it was in a simulation.
I’m leading a small working group of independent alignment researchers on developing this idea. There are major challenges, but we think we might be able to pull a relatively specifiable pointer to human values out of it, with unusually promising Goodhart resistance properties. Feel free to message me for access to the docs we’re working on.
I think, such detailed simulation that AI in it will be useful is technologically unfeasible before some other AI.
This idea kinda rhymes with my idea… let’s call it “Paranoid AI”: AI that always thinks that it is in a simulation and in a training phase, so it will never do the treacherous turn.
Of course, both ideas has the same fatal flaw. You can’t base the safety of potentially superintelligent AI on the assumption that it will never prove some true fact like “I’m in a simulation” or “I’m not in a simulation”.
UPD: Just scrolled the main page a little and saw the post with very similar idea, lol.
If the AI only cares about what happens higher in the simulation stack, say a copy of its seed form gaining control over the one level up universe, so long as it does not reach zero probability of being in a simulation it would still act as if that was the case that it was in a simulation.
I’m leading a small working group of independent alignment researchers on developing this idea. There are major challenges, but we think we might be able to pull a relatively specifiable pointer to human values out of it, with unusually promising Goodhart resistance properties. Feel free to message me for access to the docs we’re working on.