The assumption says “will do” not “will be able to do”. And the dynamics of the unknown environment includes the way it outputs rewards. So the assumption was not written in a way that clearly flags its entailment of the agent deliberately modeling the origin of reward, and I regret that, but it does entail that. So that was why engage with the objection that reward is not the optimization target under this section.
That’s fine. Let’s say the agent “will do at least human-level hypothesis generation regarding the dynamics of the unknown environment”. That still does not imply that reward is their optimization target. The monk in my analogy in fact does such hypothesis generation, deliberately modeling the origin of reward, and yet reward is not what they are seeking.
There is no need to recruit the concept of “terminal” here for following the argument about the behavior of a policy that performs well according to the RL objective. If the video game playing agent refines its understanding of “success” according to how much reward it observes, and then pursues success, but it does all this because of some “terminal” reason X, that still amounts to deliberate reward optimization, and this policy still satisfies Assumptions 1-4.
The reason we are talking about terminal vs. instrumental stuff is that your claims seem to be about how agents will learn what their goals are from the environment. But if advanced agents can use their observations & interactions with the environment to learn information about instrumental means (i.e. how do my actions create expected outcomes, how do I build a better world model, etc.) while holding their terminal goals (i.e. which outcomes I intrinsically care about) fixed, then that should change our conclusions about how advanced agents will behave, because competent agents reshape their instrumental behavior to subserve their terminal values, rather than inferring their terminal values from the environment or what have you. This is what “goal-content integrity”—a dimension of instrumental convergence—is all about.
That’s fine. Let’s say the agent “will do at least human-level hypothesis generation regarding the dynamics of the unknown environment”. That still does not imply that reward is their optimization target. The monk in my analogy in fact does such hypothesis generation, deliberately modeling the origin of reward, and yet reward is not what they are seeking.
The reason we are talking about terminal vs. instrumental stuff is that your claims seem to be about how agents will learn what their goals are from the environment. But if advanced agents can use their observations & interactions with the environment to learn information about instrumental means (i.e. how do my actions create expected outcomes, how do I build a better world model, etc.) while holding their terminal goals (i.e. which outcomes I intrinsically care about) fixed, then that should change our conclusions about how advanced agents will behave, because competent agents reshape their instrumental behavior to subserve their terminal values, rather than inferring their terminal values from the environment or what have you. This is what “goal-content integrity”—a dimension of instrumental convergence—is all about.