Are you assuming some way of preventing the AI from acting strategically, with the knowledge that some of it’s choices are likely to be tests? If the AI is, in fact, much smarter than the gatekeepers, wouldn’t you expect that it can determine what’s a trap and what’s a real decision that will actually lead to the desired (by the AI) outcome?
I think there’s a timeframe issue as well. You might be able to simulate an immediate decision, but it’s hard to test what an AI will do with some piece of information after it’s had 10 years of self-improvement and pondering.
As I keep pointing out, “utility indifference” is exactly the same as giving an infinitely strong prior.
So you might as well say “we gave the AI an infinitely strong prior that the event wasn’t a test.”
What does a reasoner do when it sees evidence that it’s being tested, given that it has an infinitely strong prior that it isn’t? It’s going to behave pretty unpredictably, probably concluding it is fundamentally mistaken about the world around it. Utility indifference results in equally unpredictable behavior. Now it’s not because the agent believes it is fundamentally mistaken, but because it only cares about what happens conditioned on being fundamentally mistaken.
I wondered about that too, but this is not like a child that might notice that his parents are not actually looking when they tell that they are looking. The AI is built in a way—via the mentioned techniques—that it doesn’t matter whether the AI understands the reasons. Because it treats them as irrelevant. Only if we have overlooked some escape route might it make a difference but an exhaustive search could plug such holes really.
Are you assuming some way of preventing the AI from acting strategically, with the knowledge that some of it’s choices are likely to be tests? If the AI is, in fact, much smarter than the gatekeepers, wouldn’t you expect that it can determine what’s a trap and what’s a real decision that will actually lead to the desired (by the AI) outcome?
I think there’s a timeframe issue as well. You might be able to simulate an immediate decision, but it’s hard to test what an AI will do with some piece of information after it’s had 10 years of self-improvement and pondering.
I’ve used utility indifference to avoid that. The AI is perfectly aware that these are, in fact, tests. see http://lesswrong.com/r/discussion/lw/mrz/ask_and_ye_shall_be_answered/ and http://lesswrong.com/r/discussion/lw/lyh/utility_vs_probability_idea_synthesis/ and ask me if you have further questions.
As I keep pointing out, “utility indifference” is exactly the same as giving an infinitely strong prior.
So you might as well say “we gave the AI an infinitely strong prior that the event wasn’t a test.”
What does a reasoner do when it sees evidence that it’s being tested, given that it has an infinitely strong prior that it isn’t? It’s going to behave pretty unpredictably, probably concluding it is fundamentally mistaken about the world around it. Utility indifference results in equally unpredictable behavior. Now it’s not because the agent believes it is fundamentally mistaken, but because it only cares about what happens conditioned on being fundamentally mistaken.
You can arrange things to minimise the problems—for instance having the outcome X or ¬X be entirely random, and independent of the AI’s impacts.
I wondered about that too, but this is not like a child that might notice that his parents are not actually looking when they tell that they are looking. The AI is built in a way—via the mentioned techniques—that it doesn’t matter whether the AI understands the reasons. Because it treats them as irrelevant. Only if we have overlooked some escape route might it make a difference but an exhaustive search could plug such holes really.