What would you say is the main benefit from the RL from Human Feedback research so far? What would have happened if the authors had instead worked on a different project?
I feel like these questions are a little tricky to answer, so instead I’ll attempt to answer the questions “What is the case for RL from human feedback (RLFHF) helping with alignment?” and “What have we learned from RLFHF research so far?”
What is the case for RLFHF helping with alignment?
(The answer will mainly be me repeating the stuff I said in my OP, but at more length.)
The most naive case for RLFHF is “you train some RL agent to assist you, giving it positive feedback when it does stuff you like and negative feedback for stuff you don’t like. Eventually it learns to model your preferences well and is able to only do stuff you like.”
The immediate objections that come to mind are:
(1) The RL agent is really learning to do stuff that lead you to giving it positive feedback (which is an imperfect proxy for “stuff you like.”) Won’t this lead to the RL agent manipulating us/replacing us with robots that always report they’re happy/otherwise Goodharting their reward function?
(2) This can only train an RL agent to do tasks that we can evaluate. What about tasks we can’t evaluate? For example, if you tell your RL agent to write macroeconomic policy proposal, we might not be able to give it feedback on whether its proposal is good or not (because we’re not smart enough to evaluate macroeconomic policy), which sinks the entire RLFHF method.
(3) A bunch of other less central concerns that I’ll relegate to footnotes.[1][2][3]
My response to objection (1) is … well at this point I’m really getting into “repeat myself from the OP” territory. Basically, I think this is a valid objection, but
(a) if the RL agent’s reward model is very accurate, it’s not obviously true that the easiest way for it to optimize for its reward is to do deceptive/Goodhart-y stuff; this feels like it should rely on empirical facts like the ones I mentioned in the OP.
(b) even if the naive approach doesn’t work because of this objection, we might be able to do other stuff on top of RLFHF (e.g. interpretability, something else we haven’t thought of yet) to penalize Goodhart-y behavior or prevent it from arising in the first place.
The obvious counterargument here is “Look, Sam, you clearly are just not appreciating how much smarter than you a superintelligence will be. Inevitably there will be some way to Goodhart the reward function to get more reward than ‘just do what we want’ would give, and no technique you come up with of trying to penalize this behavior will stop the AI from finding and exploiting this strategy.” To which I have further responses, but I think I’ll resist going further down the conversational tree.
Objection (2) above is a good one, but seems potentially surmountable to me. Namely, it seems that there might be ways to use AI to improve our ability to evaluate things. The simplest form of this is recursive reward modelling: suppose you want to use RLFHF to train an AI to do task X but task X is difficult/expensive to evaluate; instead you break “evaluate X” into a bunch of easy-to-evaluate subtasks, and train RL agents to help with those; now you’re able to more cheaply evaluate X.
In summary, the story about how RLFHF helps with alignment is “if we’re very lucky, naive RLFHF might produced aligned agents; if we’re less lucky, RLFHF + another alignment technique might still suffice.”
What have we learned from RLFHF research so far?
Here’s some stuff that I’m aware of; probably there’s a bunch of takeaways that I’m not aware of yet.
(1) Learning to Summarize from Human Feedback didn’t do recursive reward modelling as I described it above, but it did a close cousin: instead of breaking “evaluate X” up into subtasks it broke the original task X up into a bunch of subtasks which were easier to evaluate. In this case X = “summarize a book” and the subtasks were “evaluate small chunks of text.” I’m not sure how to feel about the result—the summaries were merely okay. But if you believe RLFHF could be useful as one ingredient in alignment, then further research on whether you could get this to work would seem valuable to me.
(2) Redwood’s original research project used RLFHF (at least, I think so[4]) to train an RL agent to generate text completions in which no human was portrayed as being injured. [EDIT: since I wrote this comment Redwood’s report came out. It doesn’t look like they did the RLHF part? Rather it seems like they just did the classifier part, and generated non-injurious text completions by generating a bunch of completions and filtering out the injurious ones.] Their goal was to make the RL agent very rarely(like 10^-30 of the time) generate injurious completions. I heard through the grapevine that they were not able to get such a low error rate, which is some evidence that … something? That modelling the way humans classify things with ML is hard? That distributional shift is a big deal? I’m not sure, but whatever it is it’s probably weak evidence against the usefulness of RLFHF.
(3) On the other hand, some of the original work showed that RLFHF seems to have really good sample efficiency, e.g. the agent at the top of this page learned to do a backflip with just 900 bits of human feedback. That seems good to know, and makes me think that if value learning is going to happen at all, it will happen via RLFHF.
From your original question, it seems like what you really want to know is “how does this usefulness of this research compare to the usefulness of other alignment research?” Probably that largely depends on whether you believe the basic story for how RLFHF could be useful (as well as how valuable you think other threads of alignment research are).
Q: When we first turn on the RL agent—when it hasn’t yet received much human feedback and therefore has a very inaccurate model of human preferences—won’t the agent potentially do lots of really bad things? A: Yeah, this seems plausible, but it might not be an insurmountable challenge. For instance, we could pre-train the agent’s reward model from a bunch of training runs controlled by a human operator or a less intelligent RL agent. Or maybe the people who are studying safe exploration will come up with something useful here.
Q: What about robustness to distributional shift? That is, even if our RL agent learns a good model of human preferences under ordinary circumstances, its model might be trash once things start to get weird, e.g. once we start colonizing space. A: One thing about RLFHF is that you generally shouldn’t take the reward model offline, i.e. you should always continue giving the RL agent some amount of feedback on which the reward model continuously trains. So maybe if things get continuously weirder then our RL agents’ model of human preferences will continuously learn and we’ll be fine? Otherwise, I mainly want to ignore robustness to distributional shift because it’s an issue shared by all potential outer alignment solutions that I know of. No matter what approach to alignment you take, you need to hope that either someone else solves this issue or that it ends up not being a big deal for some reason.
What about mesa-optimizers? Like in footnote 2, this is an issue for every potential alignment solution, and I’m mainly hoping that either someone solves it or it ends up not being a big deal.
Their write up of the project, consisting of step 1 (train a classifier for text that portrays injury to humans) and step 2 (use the classifier to get an RL agent to generate non-injurious text completions), makes it sounds like they stop training the classifier once they start training the RL agent. This is like doing RLFHF where you take the reward model offline, which on my understanding tends to produce bad results. So I’m guessing that actually they never took the classifier offline, in which case what they did is just vanilla RLFHF.
Thanks for the detailed answer, I am sheepish to have prompted so much effort on your part!
I guess what I was and am thinking was something like “Of course we’ll be using human feedback in our reward signal. Big AI companies will do this by default. Obviously they’ll train it to do what they want it to do and not what they don’t want it to do. The reason we are worried about AI risk is because we think that this won’t be enough.”
To which someone might respond “But still it’s good to practice doing it now. The experience might come in handy later when we are trying to align really powerful systems.”
To which I might respond “OK, but I feel like it’s a better use of our limited research time to try to anticipate ways in which RL from human feedback could turn out to be insufficient and then do research aimed at overcoming those ways. E.g. think about inner alignment problems, think about it possibly learning to do what makes us give positive feedback rather than what we actually want, etc. Let the capabilities researchers figure out how to do RL from human feedback, since they need to figure that out anyway on the path to deploying the products they are building. Safety researchers should focus on solving the problems that we anticipate RLHF doesn’t solve by itself.”
I don’t actually think this, because I haven’t thought about this much, so I’m uncertain and mostly deferring to other’s judgment. But I’d be interested to hear your thoughts! (You’ve written so much already, no need to actually reply)
Ah cool, I see—your concern is that maybe RLHF is perhaps better left to the capabilities people, freeing up AI safety researchers to work on more neglected approaches.
That seems right to me, and I agree with it as a general heuristic! Some caveats:
I’m random person who’s been learning a lot about this stuff lately, definitely not an active researcher. So my opinions about heuristics for what to work on probably aren’t worth much.
If you think RLHF research could be very impactful for alignment, that could make up for it being less neglected than other areas.
Distinctive approaches to RLHF (like Redwood’s attempts to get their reward model’s error extremely low) might be the sorts of things that capabilities people wouldn’t try.
Finally, as a historical note, it’s hard to believe that a decade ago the state of alignment was like “holy shit, how could we possibly hard-code human values into a reward function this is gonna be impossible.” The fact that now we’re like “obviously big AI will, by default, build their AGIs with something like RLHF” is progress! And Paul’s comment elsethread is heartwarming—it implies that AI safety researchers helped accelerate the adoption of this safer-looking paradigm. In other words, if you believe RLHF helps improve our odds, then contra some recent pessimistic takes, you believe that progress is being made :)
I feel like these questions are a little tricky to answer, so instead I’ll attempt to answer the questions “What is the case for RL from human feedback (RLFHF) helping with alignment?” and “What have we learned from RLFHF research so far?”
What is the case for RLFHF helping with alignment?
(The answer will mainly be me repeating the stuff I said in my OP, but at more length.)
The most naive case for RLFHF is “you train some RL agent to assist you, giving it positive feedback when it does stuff you like and negative feedback for stuff you don’t like. Eventually it learns to model your preferences well and is able to only do stuff you like.”
The immediate objections that come to mind are:
(1) The RL agent is really learning to do stuff that lead you to giving it positive feedback (which is an imperfect proxy for “stuff you like.”) Won’t this lead to the RL agent manipulating us/replacing us with robots that always report they’re happy/otherwise Goodharting their reward function?
(2) This can only train an RL agent to do tasks that we can evaluate. What about tasks we can’t evaluate? For example, if you tell your RL agent to write macroeconomic policy proposal, we might not be able to give it feedback on whether its proposal is good or not (because we’re not smart enough to evaluate macroeconomic policy), which sinks the entire RLFHF method.
(3) A bunch of other less central concerns that I’ll relegate to footnotes.[1][2][3]
My response to objection (1) is … well at this point I’m really getting into “repeat myself from the OP” territory. Basically, I think this is a valid objection, but
(a) if the RL agent’s reward model is very accurate, it’s not obviously true that the easiest way for it to optimize for its reward is to do deceptive/Goodhart-y stuff; this feels like it should rely on empirical facts like the ones I mentioned in the OP.
(b) even if the naive approach doesn’t work because of this objection, we might be able to do other stuff on top of RLFHF (e.g. interpretability, something else we haven’t thought of yet) to penalize Goodhart-y behavior or prevent it from arising in the first place.
The obvious counterargument here is “Look, Sam, you clearly are just not appreciating how much smarter than you a superintelligence will be. Inevitably there will be some way to Goodhart the reward function to get more reward than ‘just do what we want’ would give, and no technique you come up with of trying to penalize this behavior will stop the AI from finding and exploiting this strategy.” To which I have further responses, but I think I’ll resist going further down the conversational tree.
Objection (2) above is a good one, but seems potentially surmountable to me. Namely, it seems that there might be ways to use AI to improve our ability to evaluate things. The simplest form of this is recursive reward modelling: suppose you want to use RLFHF to train an AI to do task X but task X is difficult/expensive to evaluate; instead you break “evaluate X” into a bunch of easy-to-evaluate subtasks, and train RL agents to help with those; now you’re able to more cheaply evaluate X.
In summary, the story about how RLFHF helps with alignment is “if we’re very lucky, naive RLFHF might produced aligned agents; if we’re less lucky, RLFHF + another alignment technique might still suffice.”
What have we learned from RLFHF research so far?
Here’s some stuff that I’m aware of; probably there’s a bunch of takeaways that I’m not aware of yet.
(1) Learning to Summarize from Human Feedback didn’t do recursive reward modelling as I described it above, but it did a close cousin: instead of breaking “evaluate X” up into subtasks it broke the original task X up into a bunch of subtasks which were easier to evaluate. In this case X = “summarize a book” and the subtasks were “evaluate small chunks of text.” I’m not sure how to feel about the result—the summaries were merely okay. But if you believe RLFHF could be useful as one ingredient in alignment, then further research on whether you could get this to work would seem valuable to me.
(2) Redwood’s original research project used RLFHF (at least, I think so[4]) to train an RL agent to generate text completions in which no human was portrayed as being injured. [EDIT: since I wrote this comment Redwood’s report came out. It doesn’t look like they did the RLHF part? Rather it seems like they just did the classifier part, and generated non-injurious text completions by generating a bunch of completions and filtering out the injurious ones.] Their goal was to make the RL agent very rarely (like 10^-30 of the time) generate injurious completions. I heard through the grapevine that they were not able to get such a low error rate, which is some evidence that … something? That modelling the way humans classify things with ML is hard? That distributional shift is a big deal? I’m not sure, but whatever it is it’s probably weak evidence against the usefulness of RLFHF.
(3) On the other hand, some of the original work showed that RLFHF seems to have really good sample efficiency, e.g. the agent at the top of this page learned to do a backflip with just 900 bits of human feedback. That seems good to know, and makes me think that if value learning is going to happen at all, it will happen via RLFHF.
From your original question, it seems like what you really want to know is “how does this usefulness of this research compare to the usefulness of other alignment research?” Probably that largely depends on whether you believe the basic story for how RLFHF could be useful (as well as how valuable you think other threads of alignment research are).
Q: When we first turn on the RL agent—when it hasn’t yet received much human feedback and therefore has a very inaccurate model of human preferences—won’t the agent potentially do lots of really bad things? A: Yeah, this seems plausible, but it might not be an insurmountable challenge. For instance, we could pre-train the agent’s reward model from a bunch of training runs controlled by a human operator or a less intelligent RL agent. Or maybe the people who are studying safe exploration will come up with something useful here.
Q: What about robustness to distributional shift? That is, even if our RL agent learns a good model of human preferences under ordinary circumstances, its model might be trash once things start to get weird, e.g. once we start colonizing space. A: One thing about RLFHF is that you generally shouldn’t take the reward model offline, i.e. you should always continue giving the RL agent some amount of feedback on which the reward model continuously trains. So maybe if things get continuously weirder then our RL agents’ model of human preferences will continuously learn and we’ll be fine? Otherwise, I mainly want to ignore robustness to distributional shift because it’s an issue shared by all potential outer alignment solutions that I know of. No matter what approach to alignment you take, you need to hope that either someone else solves this issue or that it ends up not being a big deal for some reason.
What about mesa-optimizers? Like in footnote 2, this is an issue for every potential alignment solution, and I’m mainly hoping that either someone solves it or it ends up not being a big deal.
Their write up of the project, consisting of step 1 (train a classifier for text that portrays injury to humans) and step 2 (use the classifier to get an RL agent to generate non-injurious text completions), makes it sounds like they stop training the classifier once they start training the RL agent. This is like doing RLFHF where you take the reward model offline, which on my understanding tends to produce bad results. So I’m guessing that actually they never took the classifier offline, in which case what they did is just vanilla RLFHF.
Thanks for the detailed answer, I am sheepish to have prompted so much effort on your part!
I guess what I was and am thinking was something like “Of course we’ll be using human feedback in our reward signal. Big AI companies will do this by default. Obviously they’ll train it to do what they want it to do and not what they don’t want it to do. The reason we are worried about AI risk is because we think that this won’t be enough.”
To which someone might respond “But still it’s good to practice doing it now. The experience might come in handy later when we are trying to align really powerful systems.”
To which I might respond “OK, but I feel like it’s a better use of our limited research time to try to anticipate ways in which RL from human feedback could turn out to be insufficient and then do research aimed at overcoming those ways. E.g. think about inner alignment problems, think about it possibly learning to do what makes us give positive feedback rather than what we actually want, etc. Let the capabilities researchers figure out how to do RL from human feedback, since they need to figure that out anyway on the path to deploying the products they are building. Safety researchers should focus on solving the problems that we anticipate RLHF doesn’t solve by itself.”
I don’t actually think this, because I haven’t thought about this much, so I’m uncertain and mostly deferring to other’s judgment. But I’d be interested to hear your thoughts! (You’ve written so much already, no need to actually reply)
Ah cool, I see—your concern is that maybe RLHF is perhaps better left to the capabilities people, freeing up AI safety researchers to work on more neglected approaches.
That seems right to me, and I agree with it as a general heuristic! Some caveats:
I’m random person who’s been learning a lot about this stuff lately, definitely not an active researcher. So my opinions about heuristics for what to work on probably aren’t worth much.
If you think RLHF research could be very impactful for alignment, that could make up for it being less neglected than other areas.
Distinctive approaches to RLHF (like Redwood’s attempts to get their reward model’s error extremely low) might be the sorts of things that capabilities people wouldn’t try.
Finally, as a historical note, it’s hard to believe that a decade ago the state of alignment was like “holy shit, how could we possibly hard-code human values into a reward function this is gonna be impossible.” The fact that now we’re like “obviously big AI will, by default, build their AGIs with something like RLHF” is progress! And Paul’s comment elsethread is heartwarming—it implies that AI safety researchers helped accelerate the adoption of this safer-looking paradigm. In other words, if you believe RLHF helps improve our odds, then contra some recent pessimistic takes, you believe that progress is being made :)