This is a really cool insight, at least from an MDP theory point of view. During my PhD, I proved a bunch of results on the state visit distribution perspective and used those a few of those results to prove the power-seeking theorems. MDP theorists have, IMO, massively overlooked this perspective, and I think it’s pretty silly.
I haven’t read the paper yet, but doubt this is the full picture on Goodhart / the problems we actually care about (like getting good policies out of RL, which do what we want).
Under the geometric picture, goodharting occurs as the general case because, selecting reward functions uniformly randomly (an extremely dubious assumption!), pairs of high-dimensional vectors are generically nearly orthogonal to each other (ie have high angle).
Even practically, it’s super hard to get regret bounds for following the optimal policy (an extremely dubious assumption!) of an approximate reward function; the best bounds are, like, 21−γ⋅sups|R0(s)−R1(s)|, which is horrible.
But since you aren’t trying to address the optimal case, maybe there’s hope (although I personally don’t think that reward Goodharting is a key problem, rather than a facet of a deeper issue).
I want to note that LLM-RL is probably extremely dissimilar from doing direct gradient ascent in the convex hull of state visitation distributions / value iteration. This means—whatever this work’s other contributions—we can’t directly conclude that this causes training-distribution Goodharting in the settings we care about.
I think this is a bit misleading; the angle is larger than visually implied, since the vectors aren’t actually normal to the level sets.
Also, couldn’t I tell a story where you first move parallel to the projection of MR1onto the boundary for a while, and then move parallel to MR1? Maybe the point is “eventually (under this toy model of how DRL works) you hit the boundary, and have thereafter spent your ability to increase both of the returns at the same time. (EDIT maybe this is ruled out by your theoretical results?)
Thanks for the comment! Note that we use state-action visitation distribution, so we consider trajectories that contain actions as well. This makes it possible to invert η (as long as all states are visited). Using only states trajectories, it would indeed be impossible to recover the policy.
This is a really cool insight, at least from an MDP theory point of view. During my PhD, I proved a bunch of results on the state visit distribution perspective and used those a few of those results to prove the power-seeking theorems. MDP theorists have, IMO, massively overlooked this perspective, and I think it’s pretty silly.
I haven’t read the paper yet, but doubt this is the full picture on Goodhart / the problems we actually care about (like getting good policies out of RL, which do what we want).
Under the geometric picture, goodharting occurs as the general case because, selecting reward functions uniformly randomly (an extremely dubious assumption!), pairs of high-dimensional vectors are generically nearly orthogonal to each other (ie have high angle).
Even practically, it’s super hard to get regret bounds for following the optimal policy (an extremely dubious assumption!) of an approximate reward function; the best bounds are, like, 21−γ⋅sups|R0(s)−R1(s)|, which is horrible.
But since you aren’t trying to address the optimal case, maybe there’s hope (although I personally don’t think that reward Goodharting is a key problem, rather than a facet of a deeper issue).
I want to note that LLM-RL is probably extremely dissimilar from doing direct gradient ascent in the convex hull of state visitation distributions / value iteration. This means—whatever this work’s other contributions—we can’t directly conclude that this causes training-distribution Goodharting in the settings we care about.
I think this is a bit misleading; the angle is larger than visually implied, since the vectors aren’t actually normal to the level sets.
Also, couldn’t I tell a story where you first move parallel to the projection of MR1onto the boundary for a while, and then move parallel to MR1? Maybe the point is “eventually (under this toy model of how DRL works) you hit the boundary, and have thereafter spent your ability to increase both of the returns at the same time. (EDIT maybe this is ruled out by your theoretical results?)
Actually, this claim in the original post is false for state-based occupancy measures, but it might be true for state-action measures. From p163 of On Avoiding Power-Seeking by Artificial Intelligence:
(Edited to qualify the counterexample as only applying to the state-based case)
Thanks for the comment! Note that we use state-action visitation distribution, so we consider trajectories that contain actions as well. This makes it possible to invert η (as long as all states are visited). Using only states trajectories, it would indeed be impossible to recover the policy.
Thanks, this was an oversight on my part.