Chris_Leong comments on DPO/PPO-RLHF on LLMs incentivizes sycophancy, exaggeration and deceptive hallucination, but not misaligned powerseeking

Chris_Leong 11 Jun 2024 5:28 UTC
2 points
0
Seems like at some point we’ll need to train on outputs too complex for humans to evaluate, then we’ll end up using training methods based on outcomes in some simulation.
- tailcalled 11 Jun 2024 8:32 UTC
  2 points
  0
  Parent
  I agree. Personally my main takeaway is that it’s unwise to extrapolate alignment dynamics from the empirical results of current methods. But this is a somewhat different line of argument which I made in Where do you get your capabilities from?.