Jacob_Hilton comments on Scaling Laws for Reward Model Overoptimization

Jacob_Hilton 29 Oct 2022 5:01 UTC
LW: 2 AF: 2
0
AF
1. We are just observing that the gold RM score curves in Figure 9 overlap. In other words, the KL penalty did not affect the relationship between KL and gold RM score in this experiment, meaning that any point on the Pareto frontier could be reached using only early stopping, without the KL penalty. As mentioned though, we’ve observed this result to be sensitive to hyperparameters, and so we are less confident in it than other results in the paper.
2. I don’t have this data to hand unfortunately.
3. I don’t have this data to hand, but entropy typically falls roughly linearly over the course of training, sometimes slightly faster towards the start, and typically moving around more than KL. So I’d expect the graph to look somewhat similar, but for it to be noisier and for the functional form to not fit as well.
- Adam Jermyn 29 Oct 2022 15:46 UTC
  LW: 1 AF: 1
  0
  AF Parent
  Got it, thanks!