Thanks for the detailed answer, I am sheepish to have prompted so much effort on your part!
I guess what I was and am thinking was something like “Of course we’ll be using human feedback in our reward signal. Big AI companies will do this by default. Obviously they’ll train it to do what they want it to do and not what they don’t want it to do. The reason we are worried about AI risk is because we think that this won’t be enough.”
To which someone might respond “But still it’s good to practice doing it now. The experience might come in handy later when we are trying to align really powerful systems.”
To which I might respond “OK, but I feel like it’s a better use of our limited research time to try to anticipate ways in which RL from human feedback could turn out to be insufficient and then do research aimed at overcoming those ways. E.g. think about inner alignment problems, think about it possibly learning to do what makes us give positive feedback rather than what we actually want, etc. Let the capabilities researchers figure out how to do RL from human feedback, since they need to figure that out anyway on the path to deploying the products they are building. Safety researchers should focus on solving the problems that we anticipate RLHF doesn’t solve by itself.”
I don’t actually think this, because I haven’t thought about this much, so I’m uncertain and mostly deferring to other’s judgment. But I’d be interested to hear your thoughts! (You’ve written so much already, no need to actually reply)
Ah cool, I see—your concern is that maybe RLHF is perhaps better left to the capabilities people, freeing up AI safety researchers to work on more neglected approaches.
That seems right to me, and I agree with it as a general heuristic! Some caveats:
I’m random person who’s been learning a lot about this stuff lately, definitely not an active researcher. So my opinions about heuristics for what to work on probably aren’t worth much.
If you think RLHF research could be very impactful for alignment, that could make up for it being less neglected than other areas.
Distinctive approaches to RLHF (like Redwood’s attempts to get their reward model’s error extremely low) might be the sorts of things that capabilities people wouldn’t try.
Finally, as a historical note, it’s hard to believe that a decade ago the state of alignment was like “holy shit, how could we possibly hard-code human values into a reward function this is gonna be impossible.” The fact that now we’re like “obviously big AI will, by default, build their AGIs with something like RLHF” is progress! And Paul’s comment elsethread is heartwarming—it implies that AI safety researchers helped accelerate the adoption of this safer-looking paradigm. In other words, if you believe RLHF helps improve our odds, then contra some recent pessimistic takes, you believe that progress is being made :)
Thanks for the detailed answer, I am sheepish to have prompted so much effort on your part!
I guess what I was and am thinking was something like “Of course we’ll be using human feedback in our reward signal. Big AI companies will do this by default. Obviously they’ll train it to do what they want it to do and not what they don’t want it to do. The reason we are worried about AI risk is because we think that this won’t be enough.”
To which someone might respond “But still it’s good to practice doing it now. The experience might come in handy later when we are trying to align really powerful systems.”
To which I might respond “OK, but I feel like it’s a better use of our limited research time to try to anticipate ways in which RL from human feedback could turn out to be insufficient and then do research aimed at overcoming those ways. E.g. think about inner alignment problems, think about it possibly learning to do what makes us give positive feedback rather than what we actually want, etc. Let the capabilities researchers figure out how to do RL from human feedback, since they need to figure that out anyway on the path to deploying the products they are building. Safety researchers should focus on solving the problems that we anticipate RLHF doesn’t solve by itself.”
I don’t actually think this, because I haven’t thought about this much, so I’m uncertain and mostly deferring to other’s judgment. But I’d be interested to hear your thoughts! (You’ve written so much already, no need to actually reply)
Ah cool, I see—your concern is that maybe RLHF is perhaps better left to the capabilities people, freeing up AI safety researchers to work on more neglected approaches.
That seems right to me, and I agree with it as a general heuristic! Some caveats:
I’m random person who’s been learning a lot about this stuff lately, definitely not an active researcher. So my opinions about heuristics for what to work on probably aren’t worth much.
If you think RLHF research could be very impactful for alignment, that could make up for it being less neglected than other areas.
Distinctive approaches to RLHF (like Redwood’s attempts to get their reward model’s error extremely low) might be the sorts of things that capabilities people wouldn’t try.
Finally, as a historical note, it’s hard to believe that a decade ago the state of alignment was like “holy shit, how could we possibly hard-code human values into a reward function this is gonna be impossible.” The fact that now we’re like “obviously big AI will, by default, build their AGIs with something like RLHF” is progress! And Paul’s comment elsethread is heartwarming—it implies that AI safety researchers helped accelerate the adoption of this safer-looking paradigm. In other words, if you believe RLHF helps improve our odds, then contra some recent pessimistic takes, you believe that progress is being made :)