Sam Marks comments on Scaling Laws for Reward Model Overoptimization

Sam Marks 20 Oct 2022 19:25 UTC
15 points
5
Excellent work!
I had previously expected that training with KL-regularization would not be equivalent to early stopping in terms of its effects on RM overfitting, so I’m quite interested to see evidence otherwise. Two questions related to this:
1. In figure 9 of the paper (copied below), it looks like, from the perspective of the proxy RM, training with KL-regularization does not have the same effect as early stopping. This is pretty shocking to me: looking at only the gold RM lines of figure 9, you’d be tempted to guess that RL with KL penalty learns approximately the same policies as RL with early stopping; but in fact, these different training processes result in learning different policies which the gold RM happens to be indifferent between! Do you have any thoughts about what’s going on here? (I’ll note this seems to suggest that the RL with KL penalty ⇔ early stopping comparison should break down in the limit of proxy RM size and data, since if the proxy RM is very large and trained on a lot of data, then it should be a close approximation to the gold RM, in which case better KL-efficiency for the proxy RM should imply the same for the gold RM. But of course assuming that the proxy RM is much smaller than the gold RM is a more appropriate assumption in practice.)
2. You mention some evidence that your results on the effects of KL penalties are especially sensitive to hyperparameters—would you be willing to go into more detail?
- leogao 26 Oct 2022 4:50 UTC
  2 points
  0
  Parent
  Thanks!
  1. I don’t really have a great answer to this. My guess is that it’s related to the fact that the higher penalty runs do a bunch more steps to get to the same KL, and those extra steps do something weird. Also, it’s possible that rather than the gold RM scores becoming more like the proxy RMs with more params/data; perhaps the changed frontier is solely due to some kind of additional exploitation in those extra steps, and evaporates when the RMs become sufficiently good.
  2. Mostly some evidence of other hyperparameters not leading to this behaviour, but also observing this behaviour replicated in other environments.