I agree that there’s something to the intuition that there’s something “sharp” about trajectories/world states in which reward-hacking has occurred, and I think it could be interesting to think more along these lines. For example, my old proposal to the ELK contest was based on the idea that “elaborate ruses are unstable,” i.e. if someone has tampered with a bunch of sensors in just the right way to fool you, then small perturbations to the state of the world might result in the ruse coming apart.
I think this demo is a cool proof-of-concept but is far from being convincing enough yet to merit further investment. If I were working on this, I would try to come up with an example setting that (a) is more realistic, (b) is plausibly analogous to future cases of catastrophic reward hacking, and (c) seems especially leveraged for this technique (i.e., it seems like this technique will really dramatically outperform baselines). Other things I would do:
Think more about what the baselines are here—are there other techniques you could have used to fix the problem in this setting? (If there are but you don’t think they’ll work in all settings, then think about what properties you need a setting to have to rule out the baselines, and make sure you pick a next setting that satisfies those properties.)
The technique here seems a bit hacky—just flipping the sign of the gradient update on abnormally high-reward episodes IIUC. I think think more about if there’s something more principled to aim for here. E.g., just spitballing, maybe what you want to do is to take the original reward function R(τ), where τ is a trajectory, and instead optimize a “smoothed” reward function R′(τ) which is produced by averaging R(τ′) over a bunch of small perturbation τ′ of τ (produced e.g. by modifying τ by changing a small number of tokens).
Cool stuff!
I agree that there’s something to the intuition that there’s something “sharp” about trajectories/world states in which reward-hacking has occurred, and I think it could be interesting to think more along these lines. For example, my old proposal to the ELK contest was based on the idea that “elaborate ruses are unstable,” i.e. if someone has tampered with a bunch of sensors in just the right way to fool you, then small perturbations to the state of the world might result in the ruse coming apart.
I think this demo is a cool proof-of-concept but is far from being convincing enough yet to merit further investment. If I were working on this, I would try to come up with an example setting that (a) is more realistic, (b) is plausibly analogous to future cases of catastrophic reward hacking, and (c) seems especially leveraged for this technique (i.e., it seems like this technique will really dramatically outperform baselines). Other things I would do:
Think more about what the baselines are here—are there other techniques you could have used to fix the problem in this setting? (If there are but you don’t think they’ll work in all settings, then think about what properties you need a setting to have to rule out the baselines, and make sure you pick a next setting that satisfies those properties.)
The technique here seems a bit hacky—just flipping the sign of the gradient update on abnormally high-reward episodes IIUC. I think think more about if there’s something more principled to aim for here. E.g., just spitballing, maybe what you want to do is to take the original reward function R(τ), where τ is a trajectory, and instead optimize a “smoothed” reward function R′(τ) which is produced by averaging R(τ′) over a bunch of small perturbation τ′ of τ (produced e.g. by modifying τ by changing a small number of tokens).
Thanks for the feedback, these suggestions are definitely helpful as I’m thinking about how/if to advance the project.