RLHF with humans might also soon get obsoleted by things like online DPO where another chatbot produces preference data for on-policy responses of the tuned model, and there is no separate reward model in the RL sense. If generalization from labeling instructions through preference decisions works in practice, even weak-to-strong setting won’t necessarily be important, if tuning of a stronger model gets bootstrapped by a weaker model (where currently SFT from an obviously off-policy instruct dataset seems to suffice), but then the stronger model re-does the tuning of its equally strong successor that starts with the same base model (as in the self-rewarding paper), using some labeling instructions (“constitution”). So all that remains of human oversight that actually contributes to the outcome is labeling instructions written in English, and possibly some feedback on them from spot checking what’s going on as a result of choosing particular instructions.
RLHF with humans might also soon get obsoleted by things like online DPO where another chatbot produces preference data for on-policy responses of the tuned model, and there is no separate reward model in the RL sense. If generalization from labeling instructions through preference decisions works in practice, even weak-to-strong setting won’t necessarily be important, if tuning of a stronger model gets bootstrapped by a weaker model (where currently SFT from an obviously off-policy instruct dataset seems to suffice), but then the stronger model re-does the tuning of its equally strong successor that starts with the same base model (as in the self-rewarding paper), using some labeling instructions (“constitution”). So all that remains of human oversight that actually contributes to the outcome is labeling instructions written in English, and possibly some feedback on them from spot checking what’s going on as a result of choosing particular instructions.