Erik Jenner comments on Catastrophic Goodhart in RL with KL penalty

Erik Jenner 15 May 2024 16:54 UTC
6 points
0
The manner in which these pathological policies $π$ achieve high $E [U]$ is also concerning: most of the time they match the reference policy $π_{0}$ , but a tiny fraction of the time they will pick trajectories with extremely high reward. Thus, if we only observe actions from the policy $π$ , it could be impossible to tell whether $π$ is Goodharting or identical to the base policy.
I’m confused; to learn this policy $π$ , some of the extremely high reward trajectories would likely have to be taken during RL training, so we could see them, right? It might still be a problem if they’re very rare (e.g. if we can only manually look at a small fraction of trajectories). But if they have such high reward that they drastically affect the learned policy despite being so rare, it should be trivial to catch them as outliers based on that.
One way we wouldn’t see the trajectories is if the model becomes aligned with “maximize whatever my reward signal is,” figures out the reward function, and then executes these high-reward trajectories zero-shot. (This might never happen in training if they’re too rare to occur even once during training under the optimal policy.) But that’s a much more specific and speculative story.
I haven’t thought much about how this affects the overall takeaways but I’d guess that similar things apply to heavy-tailed rewards in general (i.e. if they’re rare but big enough to still have an important effect, we can probably catch them pretty easily—though how much that helps will of course depend on your threat model for what these errors $X$ are).
- Thomas Kwa 15 May 2024 18:28 UTC
  2 points
  0
  Parent
  This is a fair criticism. I changed “impossible” to “difficult”.
  
  My main concern is with future forms of RL that are some combination of better at optimization (thus making the model more inner aligned even in situations it never directly sees in training) and possibly opaque to humans such that we cannot just observe outliers in the reward distribution. It is not difficult to imagine that some future kind of internal reinforcement could have these properties; maybe the agent simulates various situations it could be in without stringing them together into a trajectory or something. This seems worth worrying about even though I do not have a particular sense that the field is going in this direction.