Many people seem to expect that reward will be the optimization target of really smart learned policies—that these policies will be reward optimizers. I strongly disagree. As I argue in this essay, reward is not, in general, that-which-is-optimized by RL agents. [...] Reward chisels cognitive grooves into an agent.
I do think that sufficiently sophisticated[1] RL policies trained on a predictable environment with a simple and consistent reward scheme probably will develop an internal model of the thing being rewarded, as a single salient thing, and separately that some learned models will learn to make longer-term-than-immediate predictions about the future. So as such I do expect “iterate through some likely actions, and choose one where the reward proxy is high” will at some point emerge as an available strategy for RL policies[2].
My impression is that it’s an open question to what extent that available strategy is a better-performing strategy than a more sphexish pile of “if the environment looks like this, execute that behavior” heuristics, given a fixed amount of available computational power. In the limit as the system’s computational power approaches infinite and the accuracy of its predictions about future world states approaches perfection, the argmax(EU) strategy gets reinforced more strongly than any other strategy, and so that ends up being what gets chiseled into the model’s cognition. But of course in that limit “just brute-force sha256 bro” is an optimal strategy in certain situations, so the extent to which the “in the limit” behavior resembles the “in the regimes we actually care about” behavior is debatable.
If I’m reading Ability to Solve Long Horizon Tasks Correlates With Wanting correctly, that post argues that you can’t get good performance on any task where the reward is distant in time from the actions unless your system is doing something like this.
Interesting! I had thought this already was your take, based on posts like Reward is not the Optimization Target.
I do think that sufficiently sophisticated[1] RL policies trained on a predictable environment with a simple and consistent reward scheme probably will develop an internal model of the thing being rewarded, as a single salient thing, and separately that some learned models will learn to make longer-term-than-immediate predictions about the future. So as such I do expect “iterate through some likely actions, and choose one where the reward proxy is high” will at some point emerge as an available strategy for RL policies[2].
My impression is that it’s an open question to what extent that available strategy is a better-performing strategy than a more sphexish pile of “if the environment looks like this, execute that behavior” heuristics, given a fixed amount of available computational power. In the limit as the system’s computational power approaches infinite and the accuracy of its predictions about future world states approaches perfection, the
argmax(EU)
strategy gets reinforced more strongly than any other strategy, and so that ends up being what gets chiseled into the model’s cognition. But of course in that limit “just brute-force sha256 bro” is an optimal strategy in certain situations, so the extent to which the “in the limit” behavior resembles the “in the regimes we actually care about” behavior is debatable.And “sufficiently” is likely a pretty low bar
If I’m reading Ability to Solve Long Horizon Tasks Correlates With Wanting correctly, that post argues that you can’t get good performance on any task where the reward is distant in time from the actions unless your system is doing something like this.