That’s a good point. But if you’re using a distilled, inference-bandwith-optimised RM, annotating your training data might be a fraction of compute needed for pretraining.
Also, the cost of annotation is constant and can be amortized over many training runs. PHF shares an important advantage of offline RL over online RL approaches (such as RLHF): being able to reuse feedback annotations across experiments. If you already have a dataset, running a hyperparameter sweep on it is as cheap as standard pretraining and in contrast with RLHF you don’t need to recompute rewards.
That’s a good point. But if you’re using a distilled, inference-bandwith-optimised RM, annotating your training data might be a fraction of compute needed for pretraining.
Also, the cost of annotation is constant and can be amortized over many training runs. PHF shares an important advantage of offline RL over online RL approaches (such as RLHF): being able to reuse feedback annotations across experiments. If you already have a dataset, running a hyperparameter sweep on it is as cheap as standard pretraining and in contrast with RLHF you don’t need to recompute rewards.