As far as I understand, in RLHF, PPO/DPO doesn’t directly use preferences from human raters, but instead synthetic preference data generated by a reward model. The reward model in turn is trained on preference data given by actual human raters. The reward model may be misgeneralizing this data, in which case the DPO input may include preferences that humans wouldn’t give. Which might change your conclusion.
I’d say it adds an extra step of indirection where the causal structure of reality gets “blurred out” by an agent’s judgement, and so a reward model strengthens rather than weakens this dynamic?
As far as I understand, in RLHF, PPO/DPO doesn’t directly use preferences from human raters, but instead synthetic preference data generated by a reward model. The reward model in turn is trained on preference data given by actual human raters. The reward model may be misgeneralizing this data, in which case the DPO input may include preferences that humans wouldn’t give. Which might change your conclusion.
I’d say it adds an extra step of indirection where the causal structure of reality gets “blurred out” by an agent’s judgement, and so a reward model strengthens rather than weakens this dynamic?