The latter. I didn’t notice it was a link to a different paper, but I think my point stands: the better results in this paper compared to the previous finetuning paper can’t be due to adding the KL constraint because they already had one. It has to be something else they changed, like more/better labels or bigger models.
Yeah, I definitely agree with that, I was just responding to the confusion that (I think) nostalgebraist had. Relative to the latter paper, I’d guess increased performance is primarily due to label quality and larger model.
The latter. I didn’t notice it was a link to a different paper, but I think my point stands: the better results in this paper compared to the previous finetuning paper can’t be due to adding the KL constraint because they already had one. It has to be something else they changed, like more/better labels or bigger models.
Yeah, I definitely agree with that, I was just responding to the confusion that (I think) nostalgebraist had. Relative to the latter paper, I’d guess increased performance is primarily due to label quality and larger model.