The latter. I didn’t notice it was a link to a different paper, but I think my point stands: the better results in this paper compared to the previous finetuning paper can’t be due to adding the KL constraint because they already had one. It has to be something else they changed, like more/better labels or bigger models.
Yeah, I definitely agree with that, I was just responding to the confusion that (I think) nostalgebraist had. Relative to the latter paper, I’d guess increased performance is primarily due to label quality and larger model.
By “original paper” do you mean Deep RL from Human Preferences or Fine-Tuning Language Models from Human Preferences? The latter did have a KL penalty, but OP linked to the former. I just skimmed the former again and saw no mention of a KL penalty (but I easily could have missed it).
The latter. I didn’t notice it was a link to a different paper, but I think my point stands: the better results in this paper compared to the previous finetuning paper can’t be due to adding the KL constraint because they already had one. It has to be something else they changed, like more/better labels or bigger models.
Yeah, I definitely agree with that, I was just responding to the confusion that (I think) nostalgebraist had. Relative to the latter paper, I’d guess increased performance is primarily due to label quality and larger model.