tailcalled comments on DPO/PPO-RLHF on LLMs incentivizes sycophancy, exaggeration and deceptive hallucination, but not misaligned powerseeking

tailcalled 11 Jun 2024 8:37 UTC
3 points
0
I’d say it adds an extra step of indirection where the causal structure of reality gets “blurred out” by an agent’s judgement, and so a reward model strengthens rather than weakens this dynamic?