I’m agnostic about whether the AI values reward terminally or values some other complicated mix of things. The claim I’m making is behavioral—a claim that the strategy of “try to figure out how to get the most reward” would be selected over other strategies like “always do the nice thing.”
The strategy could be compatible with a bunch of different psychological profiles. “Playing the training game” is a filter over models—lots of possible models could do it, the claim is just that we need to reason about the distribution of psychologies given that the psychologies that make it to the end of training most likely employ the strategy of playing the training game on the training distribution.
Why do I think this? Consider an AI that has high situational awareness, reasoning ability and creative planning ability (assumptions of my situation which don’t yet say anything about values). This AI has the ability to think about what kinds of actions would get the most reward (just like it has the ability to write a sonnet or solve a math problem or write some piece of software; it understands the task and has the requisite subskills). And once it has the ability, it is likely to be pushed in the direction of exercising that ability (since doing so would increase its reward).
This changes its psychology in whatever way most easily results in it doing more of the think-about-what-would-get-the-most-reward-and-do-it behavior. Terminally valuing reward and only reward would certainly do the trick, but a lot of other things would too (e.g. valuing paperclips in the very long run).
But why would that strategy be selected? Where does the selection come from? Will the designers toss a really impressive AI for not getting reward on that one timestep? I think not.
And once it has the ability, it is likely to be pushed in the direction of exercising that ability (since doing so would increase its reward).
I’m agnostic about whether the AI values reward terminally or values some other complicated mix of things. The claim I’m making is behavioral—a claim that the strategy of “try to figure out how to get the most reward” would be selected over other strategies like “always do the nice thing.”
The strategy could be compatible with a bunch of different psychological profiles. “Playing the training game” is a filter over models—lots of possible models could do it, the claim is just that we need to reason about the distribution of psychologies given that the psychologies that make it to the end of training most likely employ the strategy of playing the training game on the training distribution.
Why do I think this? Consider an AI that has high situational awareness, reasoning ability and creative planning ability (assumptions of my situation which don’t yet say anything about values). This AI has the ability to think about what kinds of actions would get the most reward (just like it has the ability to write a sonnet or solve a math problem or write some piece of software; it understands the task and has the requisite subskills). And once it has the ability, it is likely to be pushed in the direction of exercising that ability (since doing so would increase its reward).
This changes its psychology in whatever way most easily results in it doing more of the think-about-what-would-get-the-most-reward-and-do-it behavior. Terminally valuing reward and only reward would certainly do the trick, but a lot of other things would too (e.g. valuing paperclips in the very long run).
But why would that strategy be selected? Where does the selection come from? Will the designers toss a really impressive AI for not getting reward on that one timestep? I think not.
Why? I maintain that the agent would not do so, unless it were already terminally motivated by reward. For empirical example, some neuroscientists know that brain stimulation reward leads to higher reward, and the brain very likely does some kind of reinforcement learning, so why don’t neuroscientists wirehead themselves?
I was talking about gradient descent here, not designers.