As I keep pointing out, “utility indifference” is exactly the same as giving an infinitely strong prior.
So you might as well say “we gave the AI an infinitely strong prior that the event wasn’t a test.”
What does a reasoner do when it sees evidence that it’s being tested, given that it has an infinitely strong prior that it isn’t? It’s going to behave pretty unpredictably, probably concluding it is fundamentally mistaken about the world around it. Utility indifference results in equally unpredictable behavior. Now it’s not because the agent believes it is fundamentally mistaken, but because it only cares about what happens conditioned on being fundamentally mistaken.
As I keep pointing out, “utility indifference” is exactly the same as giving an infinitely strong prior.
So you might as well say “we gave the AI an infinitely strong prior that the event wasn’t a test.”
What does a reasoner do when it sees evidence that it’s being tested, given that it has an infinitely strong prior that it isn’t? It’s going to behave pretty unpredictably, probably concluding it is fundamentally mistaken about the world around it. Utility indifference results in equally unpredictable behavior. Now it’s not because the agent believes it is fundamentally mistaken, but because it only cares about what happens conditioned on being fundamentally mistaken.
You can arrange things to minimise the problems—for instance having the outcome X or ¬X be entirely random, and independent of the AI’s impacts.