“Early stopping on a separate stopping criterion which we don’t run gradients through, is not at all similar to reinforcement learning on a joint cost function additively incorporating the stopping criterion with the nominal reward.”
Sounds like this could be an interesting empirical experiment. Similar to Scaling Laws for Reward Model Overoptimization, you could start with a gold reward model that represents human preference. Then you could try to figure out the best way to train an agent to maximize gold reward using only a limited number of sampled data points from that gold distribution. For example, you could train two reward models using different samples from the same distribution and use one for training, the other for early stopping. (This is essentially the train / val split used in typical ML settings with data constraints.) You could measure the best ways to maximize gold reward on a limited data budget. Alternatively your early stopping RM could be trained on a gold RM containing distribution shift, data augmentations, adversarial perturbations, or a challenge set of particularly challenging cases.
Would you be interested to see an experiment like this? Ideally it could make progress towards an empirical study of how reward hacking happens and how to prevent it. Do you see design flaws that would prevent us from learning much? What changes would you make to the setup?
“Early stopping on a separate stopping criterion which we don’t run gradients through, is not at all similar to reinforcement learning on a joint cost function additively incorporating the stopping criterion with the nominal reward.”
Sounds like this could be an interesting empirical experiment. Similar to Scaling Laws for Reward Model Overoptimization, you could start with a gold reward model that represents human preference. Then you could try to figure out the best way to train an agent to maximize gold reward using only a limited number of sampled data points from that gold distribution. For example, you could train two reward models using different samples from the same distribution and use one for training, the other for early stopping. (This is essentially the train / val split used in typical ML settings with data constraints.) You could measure the best ways to maximize gold reward on a limited data budget. Alternatively your early stopping RM could be trained on a gold RM containing distribution shift, data augmentations, adversarial perturbations, or a challenge set of particularly challenging cases.
Would you be interested to see an experiment like this? Ideally it could make progress towards an empirical study of how reward hacking happens and how to prevent it. Do you see design flaws that would prevent us from learning much? What changes would you make to the setup?