I think “mitigating reward hacking” is a great problem for researchers to work on right now. Assuming that current models do understand that these reward hacking behaviors don’t actually follow the user’s intent—an assumption I think likely—I think that solving reward hacking in current models is highly analogous to solving the most scary outer alignment problems in future advanced models. The main disanalogies are:
Current models are not wildly superhuman at coding, so humans can still spot most of these reward hacks with enough effort.
Current models probably aren’t being deliberately dishonest, e.g. they’ll likely admit to having reward hacked if pressed.
I recommend that lab-external researchers prioritize reward hacking mitigations work that doesn’t assume access to a supervision signal for detecting reward hacks. I.e. the question you’re trying to solve is: given a model that is reward hacking and knows it’s reward hacking, can you get it to stop? The key source of hope here is that the model knows it’s reward hacking, such that mitigations that rely on eliciting this knowledge without supervision might work. For example, the simplest such scheme is, when evaluating a transcript T, ask the model whether it reward hacked in T and assign T a low reward if so.
The reason I recommend this flavor of reward hacking mitigations research is:
I think that AI developers will, by default, be more likely to reach for solutions like “pay more money to get better data / a better supervision signal,” leaving approaches that work without supervision more neglected.
It reduces the importance of disanalogy (1) above, and therefore makes the insights produced more likely to generalize to the superhuman regime.
People interested in approaches to solving reward hacking which don’t depend on supervision (and thus might scale to superintelligence) should consider looking at our earlier work on measurement tampering. I don’t particularly recommend the benchmark atm, but the discussion in the blog post might be interesting. This type of setting makes some assumptions about the type and structure of the reward hacking, but I think it will be hard to resolve reward hacking in a principled way without supervision without assumptions similar to these.
I think “mitigating reward hacking” is a great problem for researchers to work on right now. Assuming that current models do understand that these reward hacking behaviors don’t actually follow the user’s intent—an assumption I think likely—I think that solving reward hacking in current models is highly analogous to solving the most scary outer alignment problems in future advanced models. The main disanalogies are:
Current models are not wildly superhuman at coding, so humans can still spot most of these reward hacks with enough effort.
Current models probably aren’t being deliberately dishonest, e.g. they’ll likely admit to having reward hacked if pressed.
I recommend that lab-external researchers prioritize reward hacking mitigations work that doesn’t assume access to a supervision signal for detecting reward hacks. I.e. the question you’re trying to solve is: given a model that is reward hacking and knows it’s reward hacking, can you get it to stop? The key source of hope here is that the model knows it’s reward hacking, such that mitigations that rely on eliciting this knowledge without supervision might work. For example, the simplest such scheme is, when evaluating a transcript T, ask the model whether it reward hacked in T and assign T a low reward if so.
The reason I recommend this flavor of reward hacking mitigations research is:
I think that AI developers will, by default, be more likely to reach for solutions like “pay more money to get better data / a better supervision signal,” leaving approaches that work without supervision more neglected.
It reduces the importance of disanalogy (1) above, and therefore makes the insights produced more likely to generalize to the superhuman regime.
People interested in approaches to solving reward hacking which don’t depend on supervision (and thus might scale to superintelligence) should consider looking at our earlier work on measurement tampering. I don’t particularly recommend the benchmark atm, but the discussion in the blog post might be interesting. This type of setting makes some assumptions about the type and structure of the reward hacking, but I think it will be hard to resolve reward hacking in a principled way without supervision without assumptions similar to these.