Beeblebrox comments on How to Control an LLM’s Behavior (why my P(DOOM) went down)

Beeblebrox 30 Nov 2023 18:35 UTC
2 points
1
Signal boosted! This seems significantly more plausible as a path to robust alignment than trying to constrain a fundamentally unaligned model using something RLHF.
- RogerDearnaley 28 Dec 2023 5:27 UTC
  1 point
  0
  Parent
  I agree. The challenge of getting RL to do what you want it to rather then some other reward hack it came up with gets replaced with building good classifiers for human-created content: not a trivial problem, but a less challenging, less adversarial, and better understood one.