Tomek Korbak comments on Pretraining Language Models with Human Preferences

Tomek Korbak 27 Mar 2023 17:13 UTC
1 point
0
In practice I think using a trained reward model (as in RLHF), not fixed labels, is the way forward. Then the cost of acquiring the reward model is the same as in RLHF, the difference is primarily that PHF typically needs much more calls to the reward model than RLHF.