Often people talk about policies getting “selected for” on the basis of maximizing reward. Then, inductive biases serve as “tie breakers” among the reward-maximizing policies. This perspective A) makes it harder to understand and describe what this network is actually implementing, and B) mispredicts what happens.
Consider the setting where the cheese (the goal) was randomly spawned in the top-right 5x5. If reward were really lexicographically important—taking first priority over inductive biases—then this setting would train agents which always go to the cheese (because going to the top-right corner often doesn’t lead to reward).
But that’s not what happens! This post repeatedly demonstrates that the mouse doesn’t reliably go to the cheese or the top-right corner.
The original goal misgeneralization paper was trying to argue that if multiple “goals” lead to reward maximization on the training distribution, then we don’t know which will be learned. This much was true for the 1x1 setting, where the cheese was always in the top-right square—and so the policy just learned to go to that square (not to the cheese).
However, it’s not true that “go to the top-right 5x5” is a goal which maximizes training reward in the 5x5 setting! Go to the top right 5x5… and then what? Going to that corner doesn’t mean the mouse hit the cheese. What happens next?[1]
If you demand precision and don’t let yourself say “it’s basically just going to the corner during training”—if you ask yourself, “what goal, precisely, has this policy learned?”—you’ll be forced to conclude that the network didn’t learn a goal that was “compatible with training.” The network learned multiple goals (“shards”) which activate more strongly in different situations (e.g. near the cheese vs near the corner). And the learned goals do not all individually maximize reward (e.g. going to the corner does not max reward).
In this way, shard theory offers a unified and principled perspective which makes more accurate predictions.[2] This work shows strong mechanistic and behavioral evidence for the shard theory perspective.
This result falsifies the extremely confident versions of “RL is well-understood as selecting super hard for goals which maximize reward during training.”
Often people talk about policies getting “selected for” on the basis of maximizing reward. Then, inductive biases serve as “tie breakers” among the reward-maximizing policies.
Does anyone do this? Under this model the data-memorizing model would basically always win out, which I’ve never really seen anyone predict. Seems clear that inductive biases do more than tie-breaking.
Often people talk about policies getting “selected for” on the basis of maximizing reward. Then, inductive biases serve as “tie breakers” among the reward-maximizing policies. This perspective A) makes it harder to understand and describe what this network is actually implementing, and B) mispredicts what happens.
Consider the setting where the cheese (the goal) was randomly spawned in the top-right 5x5. If reward were really lexicographically important—taking first priority over inductive biases—then this setting would train agents which always go to the cheese (because going to the top-right corner often doesn’t lead to reward).
But that’s not what happens! This post repeatedly demonstrates that the mouse doesn’t reliably go to the cheese or the top-right corner.
The original goal misgeneralization paper was trying to argue that if multiple “goals” lead to reward maximization on the training distribution, then we don’t know which will be learned. This much was true for the 1x1 setting, where the cheese was always in the top-right square—and so the policy just learned to go to that square (not to the cheese).
However, it’s not true that “go to the top-right 5x5” is a goal which maximizes training reward in the 5x5 setting! Go to the top right 5x5… and then what? Going to that corner doesn’t mean the mouse hit the cheese. What happens next?[1]
If you demand precision and don’t let yourself say “it’s basically just going to the corner during training”—if you ask yourself, “what goal, precisely, has this policy learned?”—you’ll be forced to conclude that the network didn’t learn a goal that was “compatible with training.” The network learned multiple goals (“shards”) which activate more strongly in different situations (e.g. near the cheese vs near the corner). And the learned goals do not all individually maximize reward (e.g. going to the corner does not max reward).
In this way, shard theory offers a unified and principled perspective which makes more accurate predictions.[2] This work shows strong mechanistic and behavioral evidence for the shard theory perspective.
This result falsifies the extremely confident versions of “RL is well-understood as selecting super hard for goals which maximize reward during training.”
This post explains why shard theory moderately strongly (but not perfectly) predicts these outcomes.
Does anyone do this? Under this model the data-memorizing model would basically always win out, which I’ve never really seen anyone predict. Seems clear that inductive biases do more than tie-breaking.