Would you have any thoughts on the safety implications of reinforcement learning from human feedback (RLHF)? The HFDT failure mode discussed here seems very similar to what Paul and others have worked on at OpenAI, Anthropic, and elsewhere. Some have criticized this line of research as only teaching brittle task-specific preferences in a manner that’s open to deception, therefore advancing us towards more dangerous capabilities. If we achieve transformative AI within the next decade, it seems plausible that large language models and RLHF will play an important role in those systems — so why do safety minded folks work on it?
According to my understanding, there are three broad reasons that safety-focused people worked on human feedback in the past (despite many of them, certainly including Paul, agreeing with this post that pure human feedback is likely to lead to takeover):
Human feedback is better than even-worse alternatives such as training the AI on a collection of fully automated rewards (predicting the next token, winning games, proving theorems, etc) and waiting for it to get smart enough to generalize well enough to be helpful / follow instructions. So it seemed good to move the culture at AI labs away from automated and easy rewards and toward human feedback.
You need to have human feedback working pretty well to start testing many other strategies for alignment like debate and recursive reward modeling and training-for-interpretability, which tend to build on a foundation of human feedback.
Human feedback provides a more realistic baseline to compare other strategies to—you want to be able to tell clearly if your alignment scheme actually works better than human feedback.
With that said, my guess is that on the current margin people focused on safety shouldn’t be spending too much more time refining pure human feedback (and ML alignment practitioners I’ve talked to largely agree, e.g. the OpenAI safety team recently released this critiques work—one step in the direction of debate).
I mean things like tricks to improve the sample efficiency of human feedback, doing more projects that are un-enhanced RLHF to learn things about how un-enhanced RLHF works, etc.
Would you have any thoughts on the safety implications of reinforcement learning from human feedback (RLHF)? The HFDT failure mode discussed here seems very similar to what Paul and others have worked on at OpenAI, Anthropic, and elsewhere. Some have criticized this line of research as only teaching brittle task-specific preferences in a manner that’s open to deception, therefore advancing us towards more dangerous capabilities. If we achieve transformative AI within the next decade, it seems plausible that large language models and RLHF will play an important role in those systems — so why do safety minded folks work on it?
According to my understanding, there are three broad reasons that safety-focused people worked on human feedback in the past (despite many of them, certainly including Paul, agreeing with this post that pure human feedback is likely to lead to takeover):
Human feedback is better than even-worse alternatives such as training the AI on a collection of fully automated rewards (predicting the next token, winning games, proving theorems, etc) and waiting for it to get smart enough to generalize well enough to be helpful / follow instructions. So it seemed good to move the culture at AI labs away from automated and easy rewards and toward human feedback.
You need to have human feedback working pretty well to start testing many other strategies for alignment like debate and recursive reward modeling and training-for-interpretability, which tend to build on a foundation of human feedback.
Human feedback provides a more realistic baseline to compare other strategies to—you want to be able to tell clearly if your alignment scheme actually works better than human feedback.
With that said, my guess is that on the current margin people focused on safety shouldn’t be spending too much more time refining pure human feedback (and ML alignment practitioners I’ve talked to largely agree, e.g. the OpenAI safety team recently released this critiques work—one step in the direction of debate).
By “refining pure human feedback”, do you mean refining RLHF ML techniques?
I assume you still view enhancing human feedback as valuable? And also more straightforwardly just increasing the quality of the best human feedback?
I mean things like tricks to improve the sample efficiency of human feedback, doing more projects that are un-enhanced RLHF to learn things about how un-enhanced RLHF works, etc.