The original paper & codebase definitely had KL penalties on the PPO policy. I spent a fair bit of time fiddling with it and letting it go high to see what adversarial ABC music examples it found in the hopes that it would train the reward model better when I labeled them. Didn’t seem to work, it would just find similar and only slightly different examples.
The latter. I didn’t notice it was a link to a different paper, but I think my point stands: the better results in this paper compared to the previous finetuning paper can’t be due to adding the KL constraint because they already had one. It has to be something else they changed, like more/better labels or bigger models.
Yeah, I definitely agree with that, I was just responding to the confusion that (I think) nostalgebraist had. Relative to the latter paper, I’d guess increased performance is primarily due to label quality and larger model.
The original paper & codebase definitely had KL penalties on the PPO policy. I spent a fair bit of time fiddling with it and letting it go high to see what adversarial ABC music examples it found in the hopes that it would train the reward model better when I labeled them. Didn’t seem to work, it would just find similar and only slightly different examples.
By “original paper” do you mean Deep RL from Human Preferences or Fine-Tuning Language Models from Human Preferences? The latter did have a KL penalty, but OP linked to the former. I just skimmed the former again and saw no mention of a KL penalty (but I easily could have missed it).
The latter. I didn’t notice it was a link to a different paper, but I think my point stands: the better results in this paper compared to the previous finetuning paper can’t be due to adding the KL constraint because they already had one. It has to be something else they changed, like more/better labels or bigger models.
Yeah, I definitely agree with that, I was just responding to the confusion that (I think) nostalgebraist had. Relative to the latter paper, I’d guess increased performance is primarily due to label quality and larger model.