We are just observing that the gold RM score curves in Figure 9 overlap. In other words, the KL penalty did not affect the relationship between KL and gold RM score in this experiment, meaning that any point on the Pareto frontier could be reached using only early stopping, without the KL penalty. As mentioned though, we’ve observed this result to be sensitive to hyperparameters, and so we are less confident in it than other results in the paper.
I don’t have this data to hand unfortunately.
I don’t have this data to hand, but entropy typically falls roughly linearly over the course of training, sometimes slightly faster towards the start, and typically moving around more than KL. So I’d expect the graph to look somewhat similar, but for it to be noisier and for the functional form to not fit as well.
We are just observing that the gold RM score curves in Figure 9 overlap. In other words, the KL penalty did not affect the relationship between KL and gold RM score in this experiment, meaning that any point on the Pareto frontier could be reached using only early stopping, without the KL penalty. As mentioned though, we’ve observed this result to be sensitive to hyperparameters, and so we are less confident in it than other results in the paper.
I don’t have this data to hand unfortunately.
I don’t have this data to hand, but entropy typically falls roughly linearly over the course of training, sometimes slightly faster towards the start, and typically moving around more than KL. So I’d expect the graph to look somewhat similar, but for it to be noisier and for the functional form to not fit as well.
Got it, thanks!