Besides the stochastic solution I offered in my other comment, we could have some system in place where the AI was uncertain about who their principal agent is among some set. This could be defined last minute, or could be fully undefined.
In Paul Boxing for instance this would be running the counterfactuals on a few mildly distinct scenarios, and maximizing the number your principal agent gives you, but now knowing which of the agents will provide you with the number. (In case your creativity is running short of examples of how to vary a counterfactual with a person in a room with a computer, I would suggest varying some of the unknown or indeterminate metaphysical properties. You could run the same physical scenario, but in one simulation proximity in possible worlds is distributed in way A, in other B etc… in one scenario the continuum hypothesis is true, in another personal identity is a sorites problem, etc… You don’t need to vary the room itself, just some properties that supervene on physics in that counterfactual)
If the idea that an AI would never be uncertain who their principal agent was has crossed your mind
“because an AI is epistemically superior”, consider the scenario in which the same thing that causes you to experience many worlds as probabilities also makes the AI be located in single worlds. Then it already has uncertainty over who it’s principal agent would be, since each instance of the AI would be in a different branch.
(wow, the above paragraph was awful writing… growth mindset… EDIT: If you can’t conceive of an AI being uncertain over who their principal agent is, notice that, like you, it is possible that an AI is a macroscopic object inhabiting part, but not the totality of, the wavefunction. The AI therefore would also have the same sort of uncertainty you have when performing a radioactive decay experiment. In other words, any AI that lives in cross sections of many-worlds in the same way we humans do would have similar reasons to not know in which of many epistemically equivalent worlds it is inhabiting. It would already be uncertain about it’s principal agent because of that, if nothing else. )
Besides the stochastic solution I offered in my other comment, we could have some system in place where the AI was uncertain about who their principal agent is among some set. This could be defined last minute, or could be fully undefined.
In Paul Boxing for instance this would be running the counterfactuals on a few mildly distinct scenarios, and maximizing the number your principal agent gives you, but now knowing which of the agents will provide you with the number.
(In case your creativity is running short of examples of how to vary a counterfactual with a person in a room with a computer, I would suggest varying some of the unknown or indeterminate metaphysical properties. You could run the same physical scenario, but in one simulation proximity in possible worlds is distributed in way A, in other B etc… in one scenario the continuum hypothesis is true, in another personal identity is a sorites problem, etc… You don’t need to vary the room itself, just some properties that supervene on physics in that counterfactual)
If the idea that an AI would never be uncertain who their principal agent was has crossed your mind “because an AI is epistemically superior”, consider the scenario in which the same thing that causes you to experience many worlds as probabilities also makes the AI be located in single worlds. Then it already has uncertainty over who it’s principal agent would be, since each instance of the AI would be in a different branch.
(wow, the above paragraph was awful writing… growth mindset… EDIT: If you can’t conceive of an AI being uncertain over who their principal agent is, notice that, like you, it is possible that an AI is a macroscopic object inhabiting part, but not the totality of, the wavefunction. The AI therefore would also have the same sort of uncertainty you have when performing a radioactive decay experiment. In other words, any AI that lives in cross sections of many-worlds in the same way we humans do would have similar reasons to not know in which of many epistemically equivalent worlds it is inhabiting. It would already be uncertain about it’s principal agent because of that, if nothing else. )