The assumption says “will do” not “will be able to do”. And the dynamics of the unknown environment includes the way it outputs rewards. So the assumption was not written in a way that clearly flags its entailment of the agent deliberately modeling the origin of reward, and I regret that, but it does entail that. So that was why engage with the objection that reward is not the optimization target under this section.
In the video game playing setting you describe, it is perfectly conceivable that the agent deliberately acts to optimize for high in-game scores without being terminally motivated by reward,
There is no need to recruit the concept of “terminal” here for following the argument about the behavior of a policy that performs well according to the RL objective. If the video game playing agent refines its understanding of “success” according to how much reward it observes, and then pursues success, but it does all this because of some “terminal” reason X, that still amounts to deliberate reward optimization, and this policy still satisfies Assumptions 1-4.
If I want to analyze what would probably happen if Edward Snowden tried to enter the White House, there’s lots I can say without needing to understand what deep reason he had for trying to do this. I can just look at the implications of his attempt to enter the White House: he’d probably get caught and go to jail for a long time. Likewise, if an RL agent is trying to maximize is reward, there’s plenty of analysis we can do that is independent of whether there’s some other terminal reason for this.
The assumption says “will do” not “will be able to do”. And the dynamics of the unknown environment includes the way it outputs rewards. So the assumption was not written in a way that clearly flags its entailment of the agent deliberately modeling the origin of reward, and I regret that, but it does entail that. So that was why engage with the objection that reward is not the optimization target under this section.
That’s fine. Let’s say the agent “will do at least human-level hypothesis generation regarding the dynamics of the unknown environment”. That still does not imply that reward is their optimization target. The monk in my analogy in fact does such hypothesis generation, deliberately modeling the origin of reward, and yet reward is not what they are seeking.
There is no need to recruit the concept of “terminal” here for following the argument about the behavior of a policy that performs well according to the RL objective. If the video game playing agent refines its understanding of “success” according to how much reward it observes, and then pursues success, but it does all this because of some “terminal” reason X, that still amounts to deliberate reward optimization, and this policy still satisfies Assumptions 1-4.
The reason we are talking about terminal vs. instrumental stuff is that your claims seem to be about how agents will learn what their goals are from the environment. But if advanced agents can use their observations & interactions with the environment to learn information about instrumental means (i.e. how do my actions create expected outcomes, how do I build a better world model, etc.) while holding their terminal goals (i.e. which outcomes I intrinsically care about) fixed, then that should change our conclusions about how advanced agents will behave, because competent agents reshape their instrumental behavior to subserve their terminal values, rather than inferring their terminal values from the environment or what have you. This is what “goal-content integrity”—a dimension of instrumental convergence—is all about.
If the video game playing agent refines its understanding of “success” according to how much reward it observes, and then pursues success
The video game player doesn’t want high reward that comes from cheating. It is not behaviourally identical to a reward maximiser unless you take the reward to be the quantity “what I would’ve received if I hadn’t cheated”
The assumption says “will do” not “will be able to do”. And the dynamics of the unknown environment includes the way it outputs rewards. So the assumption was not written in a way that clearly flags its entailment of the agent deliberately modeling the origin of reward, and I regret that, but it does entail that. So that was why engage with the objection that reward is not the optimization target under this section.
There is no need to recruit the concept of “terminal” here for following the argument about the behavior of a policy that performs well according to the RL objective. If the video game playing agent refines its understanding of “success” according to how much reward it observes, and then pursues success, but it does all this because of some “terminal” reason X, that still amounts to deliberate reward optimization, and this policy still satisfies Assumptions 1-4.
If I want to analyze what would probably happen if Edward Snowden tried to enter the White House, there’s lots I can say without needing to understand what deep reason he had for trying to do this. I can just look at the implications of his attempt to enter the White House: he’d probably get caught and go to jail for a long time. Likewise, if an RL agent is trying to maximize is reward, there’s plenty of analysis we can do that is independent of whether there’s some other terminal reason for this.
That’s fine. Let’s say the agent “will do at least human-level hypothesis generation regarding the dynamics of the unknown environment”. That still does not imply that reward is their optimization target. The monk in my analogy in fact does such hypothesis generation, deliberately modeling the origin of reward, and yet reward is not what they are seeking.
The reason we are talking about terminal vs. instrumental stuff is that your claims seem to be about how agents will learn what their goals are from the environment. But if advanced agents can use their observations & interactions with the environment to learn information about instrumental means (i.e. how do my actions create expected outcomes, how do I build a better world model, etc.) while holding their terminal goals (i.e. which outcomes I intrinsically care about) fixed, then that should change our conclusions about how advanced agents will behave, because competent agents reshape their instrumental behavior to subserve their terminal values, rather than inferring their terminal values from the environment or what have you. This is what “goal-content integrity”—a dimension of instrumental convergence—is all about.
The video game player doesn’t want high reward that comes from cheating. It is not behaviourally identical to a reward maximiser unless you take the reward to be the quantity “what I would’ve received if I hadn’t cheated”