Not the OP, so I’ll try to explain how I understood the post based on past discussions. [And pray that I’m not misrepresenting TurnTrout’s model.]
(Though I’m not clear on how much you are talking about the suboptimality of SGD, vs the fact that optimal policies themselves do not explicitly represent or pursue reward given that complex stews of heuristics may be faster or simpler. And it also seems plausible you are talking about something else entirely.)
As I read it, the post is not focused on some generally-applicable suboptimality of SGD, nor is it saying that policies that would maximize reward in training need to explicitly represent reward.
It is mainly talking about an identifiability gap within certain forms of reinforcement learning: there is a range of cognition compatible with the same reward performance. Computations that have the side effect of incrementing reward—because, for instance, the agent is competently trying to do the rewarded task—would be reinforced if the agent adopted them, in the same way that computations that act *in order to* increment reward would. Given that, some other rationale beyond the reward performance one seems necessary in order for us to expect the particular pattern of reward optimization (“reward but no task completion”) from RL agents.
In addition to the identifiability issue, the post (as well as Steve Byrnes in a sister thread) notes a kind of inner alignment issue. Because an RL agent influences its own training process, it can steer itself towards futures where its existing motivations are preserved instead of being modified (for example, modified into reward optimizing ones). In fact, that seems more and more likely as the agent grows towards strategic awareness, since then it could model how its behavior might lead to its goals being changed. This second issue is dependent on the fact we are doing local search, in that the current agent can sway which policies are available for selection.
Together these point towards a certain way of reasoning about agents under RL: modeling their current cognition (including their motivations, values etc.) as downstream of past reinforcement & punishment events. I think that this kind of reasoning should constrain our expectations about how reinforcement schedules + training environments + inductive biases lead to particular patterns of behavior, in a way that is more specific than if we were only reasoning about reward-optimal policies. Though I am less certain at the moment about how to flesh that out.
Not the OP, so I’ll try to explain how I understood the post based on past discussions. [And pray that I’m not misrepresenting TurnTrout’s model.]
As I read it, the post is not focused on some generally-applicable suboptimality of SGD, nor is it saying that policies that would maximize reward in training need to explicitly represent reward.
It is mainly talking about an identifiability gap within certain forms of reinforcement learning: there is a range of cognition compatible with the same reward performance. Computations that have the side effect of incrementing reward—because, for instance, the agent is competently trying to do the rewarded task—would be reinforced if the agent adopted them, in the same way that computations that act *in order to* increment reward would. Given that, some other rationale beyond the reward performance one seems necessary in order for us to expect the particular pattern of reward optimization (“reward but no task completion”) from RL agents.
In addition to the identifiability issue, the post (as well as Steve Byrnes in a sister thread) notes a kind of inner alignment issue. Because an RL agent influences its own training process, it can steer itself towards futures where its existing motivations are preserved instead of being modified (for example, modified into reward optimizing ones). In fact, that seems more and more likely as the agent grows towards strategic awareness, since then it could model how its behavior might lead to its goals being changed. This second issue is dependent on the fact we are doing local search, in that the current agent can sway which policies are available for selection.
Together these point towards a certain way of reasoning about agents under RL: modeling their current cognition (including their motivations, values etc.) as downstream of past reinforcement & punishment events. I think that this kind of reasoning should constrain our expectations about how reinforcement schedules + training environments + inductive biases lead to particular patterns of behavior, in a way that is more specific than if we were only reasoning about reward-optimal policies. Though I am less certain at the moment about how to flesh that out.