Signal boosted! This seems significantly more plausible as a path to robust alignment than trying to constrain a fundamentally unaligned model using something RLHF.
I agree. The challenge of getting RL to do what you want it to rather then some other reward hack it came up with gets replaced with building good classifiers for human-created content: not a trivial problem, but a less challenging, less adversarial, and better understood one.
Signal boosted! This seems significantly more plausible as a path to robust alignment than trying to constrain a fundamentally unaligned model using something RLHF.
I agree. The challenge of getting RL to do what you want it to rather then some other reward hack it came up with gets replaced with building good classifiers for human-created content: not a trivial problem, but a less challenging, less adversarial, and better understood one.