Cool, seems reasonable. Here are some minor responses: (perhaps unwisely, given that we’re in a semantics labyrinth)
Evan’s footnote-definition doesn’t rule out malign priors unless we assume that the real world isn’t a simulation
Idk, if the real world is a simulation made by malign simulators, I wouldn’t say that an AI accurately predicting the world is falling prey to malign priors. I would probably want my AI to accurately predict the world I’m in even if it’s simulated. The simulators control everything that happens anyway, so if they want our AIs to behave in some particular way, they can always just make them do that no matter what we do.
you are changing the definition of outer alignment if you think it assumes we aren’t in a simulation
Fwiw, I think this is true for a definition that always assumes that we’re outside a simulation, but I think it’s in line with previous definitions to say that the AI should think we’re not in a simulation iff we’re not in a simulation. That’s just stipulating unrealistically competetent prediction. Another way to look at it is that in the limit of infinite in-distribution data, an AI may well never be able to tell whether we’re in the real world or in a simulation that’s identical to the real world; but they would be able to tell whether we’re in a simulation with simulators who actually intervene, because it would see them intervening somewhere in its infinite dataset. And that’s the type of simulators that we care about. So definitions of outer alignment that appeal to infinite data automatically assumes that AIs would be able to tell the difference between worlds that are functionally like the real world, and worlds with intervening simulators.
And then, yeah, in practice I agree we won’t be able to learn whether we’re in a simulation or not, because we can’t guarantee in-distribution data. So this is largely semantics. But I do think definitions like this end up being practically useful, because convincing the agent that it’s not individually being simulated is already an inner alignment issue, for malign-prior-reasons, and this is very similar.
Cool, seems reasonable. Here are some minor responses: (perhaps unwisely, given that we’re in a semantics labyrinth)
Idk, if the real world is a simulation made by malign simulators, I wouldn’t say that an AI accurately predicting the world is falling prey to malign priors. I would probably want my AI to accurately predict the world I’m in even if it’s simulated. The simulators control everything that happens anyway, so if they want our AIs to behave in some particular way, they can always just make them do that no matter what we do.
Fwiw, I think this is true for a definition that always assumes that we’re outside a simulation, but I think it’s in line with previous definitions to say that the AI should think we’re not in a simulation iff we’re not in a simulation. That’s just stipulating unrealistically competetent prediction. Another way to look at it is that in the limit of infinite in-distribution data, an AI may well never be able to tell whether we’re in the real world or in a simulation that’s identical to the real world; but they would be able to tell whether we’re in a simulation with simulators who actually intervene, because it would see them intervening somewhere in its infinite dataset. And that’s the type of simulators that we care about. So definitions of outer alignment that appeal to infinite data automatically assumes that AIs would be able to tell the difference between worlds that are functionally like the real world, and worlds with intervening simulators.
And then, yeah, in practice I agree we won’t be able to learn whether we’re in a simulation or not, because we can’t guarantee in-distribution data. So this is largely semantics. But I do think definitions like this end up being practically useful, because convincing the agent that it’s not individually being simulated is already an inner alignment issue, for malign-prior-reasons, and this is very similar.