There’s a huge amount of ontological confusion about how to think of “objectives” for optimization processes. I think people tend to take an inappropriate intentional stance and treat something like “deliberately steering towards certain abstract notions” as a simple primitive (because it feels introspectively simple to them). This background assumption casts a shadow over all future analysis, since people try to abstract the dynamics of optimization processes in terms of their “true objectives”, when there really isn’t any such thing.
Optimization processes (or at least, evolution and RL) are better thought of in terms of what sorts of behavioral patterns were actually selected for in the history of the process. E.g., @Kaj_Sotala’s point here about tracking the effects of evolution by thinking about what sorts of specific adaptations were actually historically selected for, rather than thinking about some abstract notion of inclusive genetic fitness, and how the difference between modern and ancestral humans seems much smaller from this perspective.
I want to make a similar point about reward in the context of RL: reward is a measure of update strength, not the selection target. We can see as much by just looking at the update equations for REINFORCE (from page 328 of Reinforcement Learning: An Introduction):
The reward[1] is literally a (per step) multiplier of the learning rate. You can also think of it as providing the weights of a linear combination of the parameter gradients, which means that it’s the historical action trajectories that determine what subspaces of the parameters can potentially be explored. And due to the high correlations between gradients (at least compared to the full volume of parameter space), this means it’s the action trajectories, and not the reward function, that provides most of the information relevant for the NN’s learning process.
on many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with “wrong” reward labels, such as those that are zero everywhere or are negatives of the true rewards. This phenomenon cannot be easily explained by offline RL’s return maximization objective. Moreover, it gives offline RL a degree of robustness that is uncharacteristic of its online RL counterparts, which are known to be sensitive to reward design.
Trying to preempt possible confusion:
I expect some people to object that the point of the evolutionary analogy is precisely to show that the high-level abstract objective of the optimization process isn’t incorporated into the goals of the optimized product, and that this is a reason for concern because it suggests an unpredictable/uncontrollable mapping between outer and inner optimization objectives.
My point here is that, if you want to judge an optimization process’s predictability/controllability, you should not be comparing some abstract notion of the process’s “true outer objective” to the result’s “true inner objective”. Instead, you should consider the historical trajectory of how the optimization process actually adjusted the behaviors of the thing being optimized, and consider how predictable that thing’s future behaviors are, given past behaviors / updates.
@Kaj_Sotala argues above that this perspective implies greater consistency in human goals between the ancestral and modern environments, since the goals evolution actually historically selected for in the ancestral environment are ~the same goals humans pursue in the modern environment.
For RL agents, I am also arguing that thinking in terms of the historical action trajectories that were actually reinforced during training implies greater consistency, as compared to thinking of things in terms of some “true goal” of the training process. E.g., Goal Misgeneralization in Deep Reinforcement Learning trained a mouse to navigate to cheese that was always placed in the upper right corner of the maze and found that it would continue going to the upper right even when the cheese was moved.
This is actually a high degree of consistency from the perspective of the historical action trajectories. During training, the mouse continually executed the action trajectories that navigated it to the upper right of the board, and continued to do the exact same thing in the modified testing environment.
Technically it’s the future return in this formulation, and current SOTA RL algorithms can be different / more complex, but I think this perspective is still a more accurate intuition pump than notions of “reward as objective”, even for setups where “reward as a learning rate multiplier” isn’t literally true.
I expect some people to object that the point of the evolutionary analogy is precisely to show that the high-level abstract objective of the optimization process isn’t incorporated into the goals of the optimized product, and that this is a reason for concern because it suggests an unpredictable/uncontrollable mapping between outer and inner optimization objectives.
My point here is that, if you want to judge an optimization process’s predictability/controllability, you should not be comparing some abstract notion of the process’s “true outer objective” to the result’s “true inner objective”. Instead, you should consider the historical trajectory of how the optimization process actually adjusted the behaviors of the thing being optimized, and consider how predictable that thing’s future behaviors are, given past behaviors / updates.
@Kaj_Sotala argues above that this perspective implies greater consistency in human goals between the ancestral and modern environments, since the goals evolution actually historically selected for in the ancestral environment are ~the same goals humans pursue in the modern environment.
This is a great post! Thank you for writing it.
There’s a huge amount of ontological confusion about how to think of “objectives” for optimization processes. I think people tend to take an inappropriate intentional stance and treat something like “deliberately steering towards certain abstract notions” as a simple primitive (because it feels introspectively simple to them). This background assumption casts a shadow over all future analysis, since people try to abstract the dynamics of optimization processes in terms of their “true objectives”, when there really isn’t any such thing.
Optimization processes (or at least, evolution and RL) are better thought of in terms of what sorts of behavioral patterns were actually selected for in the history of the process. E.g., @Kaj_Sotala’s point here about tracking the effects of evolution by thinking about what sorts of specific adaptations were actually historically selected for, rather than thinking about some abstract notion of inclusive genetic fitness, and how the difference between modern and ancestral humans seems much smaller from this perspective.
I want to make a similar point about reward in the context of RL: reward is a measure of update strength, not the selection target. We can see as much by just looking at the update equations for REINFORCE (from page 328 of Reinforcement Learning: An Introduction):
The reward[1] is literally a (per step) multiplier of the learning rate. You can also think of it as providing the weights of a linear combination of the parameter gradients, which means that it’s the historical action trajectories that determine what subspaces of the parameters can potentially be explored. And due to the high correlations between gradients (at least compared to the full volume of parameter space), this means it’s the action trajectories, and not the reward function, that provides most of the information relevant for the NN’s learning process.
From Survival Instinct in Offline Reinforcement Learning:
Trying to preempt possible confusion:
I expect some people to object that the point of the evolutionary analogy is precisely to show that the high-level abstract objective of the optimization process isn’t incorporated into the goals of the optimized product, and that this is a reason for concern because it suggests an unpredictable/uncontrollable mapping between outer and inner optimization objectives.
My point here is that, if you want to judge an optimization process’s predictability/controllability, you should not be comparing some abstract notion of the process’s “true outer objective” to the result’s “true inner objective”. Instead, you should consider the historical trajectory of how the optimization process actually adjusted the behaviors of the thing being optimized, and consider how predictable that thing’s future behaviors are, given past behaviors / updates.
@Kaj_Sotala argues above that this perspective implies greater consistency in human goals between the ancestral and modern environments, since the goals evolution actually historically selected for in the ancestral environment are ~the same goals humans pursue in the modern environment.
For RL agents, I am also arguing that thinking in terms of the historical action trajectories that were actually reinforced during training implies greater consistency, as compared to thinking of things in terms of some “true goal” of the training process. E.g., Goal Misgeneralization in Deep Reinforcement Learning trained a mouse to navigate to cheese that was always placed in the upper right corner of the maze and found that it would continue going to the upper right even when the cheese was moved.
This is actually a high degree of consistency from the perspective of the historical action trajectories. During training, the mouse continually executed the action trajectories that navigated it to the upper right of the board, and continued to do the exact same thing in the modified testing environment.
Technically it’s the future return in this formulation, and current SOTA RL algorithms can be different / more complex, but I think this perspective is still a more accurate intuition pump than notions of “reward as objective”, even for setups where “reward as a learning rate multiplier” isn’t literally true.
I think this is really lucid and helpful: