Can you expand on “Separately, I expect that the quoted comment results in a misleadingly negative perception of the current situation.”?
I’m skeptical that increased scale makes hacking the reward model worse. Of course, it could (and likely will/does) make hacking human labelers more of a problem, but this isn’t what the comment appears to be saying.
Note that the reward model is of the same scale as the base model, so the relative scale should be the same.
This also contradicts results from an earlier paper by Leo Gao. I think this paper is considerably more reliable than the comment overall, so I’m inclined to believe the paper or think that I’m misunderstanding the comment.
Additionally, from first principles I think that RLHF sample efficiency should just increase with scale (at least with well tuned hyperparameters) and I think I’ve heard various things that confirm this.
I’m skeptical that increased scale makes hacking the reward model worse. Of course, it could (and likely will/does) make hacking human labelers more of a problem, but this isn’t what the comment appears to be saying.
Note that the reward model is of the same scale as the base model, so the relative scale should be the same.
This also contradicts results from an earlier paper by Leo Gao. I think this paper is considerably more reliable than the comment overall, so I’m inclined to believe the paper or think that I’m misunderstanding the comment.
Additionally, from first principles I think that RLHF sample efficiency should just increase with scale (at least with well tuned hyperparameters) and I think I’ve heard various things that confirm this.