This was a nice post! I appreciate the effort you’re making to get your inside view out there.
A correction:
The ultimate goal is to get a reward model that represents human preferences for how a task should be done: this is also known as Inverse Reinforcement Learning.
Based on this sentence, you might be conflating value learning (the broad class of approaches to outer alignment that involve learning reward models) with IRL, which is the particular sub-type of value learning in which the ML model tries to infer a reward function by observing the behavior of some agent whose behavior is assumed (approximately) optimal for said reward function.So, for example, IRL includes learning how to fly a helicopter by watching an expert, but not the approach used in “Learning to summarize from human feedback,” in which a reward model was trained via supervised learning from pairwise comparisons.
Relatedly, I’ll note that much (though not all) of the skepticism about value learning linked in the “Outer alignment concerns” section is IRL-specific. In more detail, many of the linked posts revolve around the IRL-specific issue of “How do you correct for your ‘expert demonstration’ actually being performed by an suboptimal human?[1]” But this concern doesn’t seem to apply to all types of value learning; for examples RLHF doesn’t require that humans to be approximately optimal at the task, only that we are able to judge completions of the task. (That said, I haven’t read the “Value Learning” sequence in detail, so it’s possible I’m misunderstanding and they actually explain how this concern generalizes to all value learning approaches?[2])
Unrelated to the point about IRL, my inside view agrees with yours that an important next step in RLHF is making it possible for humans to give richer feedback, e.g. natural language feedback, trajectory corrections, etc. I, too, was excited by the Reward-rational choice paper (if not the particular formalism proposed there, then the general thrust that we should have a framework for giving lots of different types of feedback to our AI systems). Conversely, my inside view finds CIRL less promising than yours does.
Human irrationality is one example of suboptimal human behavior, but there are others too. For example, a perfect IRL agent watching a human playing a video game in which perfect play requires super-human reflexes would infer that the human wanted to react slowly. So suboptimal behavior is an obstruction to both correctly inferring human values and producing a super-human agent via IRL.
According to me, the generalized version of this concern is “How do you get an agent whose reward function was learned via some value learning approach to have super-human performance?” The avatar of this for RLHF is scalable oversight, as you address in your post.
Thanks for the feedback and corrections! You’re right, I was definitely confusing IRL, which is one approach to value learning, with the value learning project as a whole. I think you’re also right that most of the “Outer alignment concerns” section doesn’t really apply to RLHF as it’s currently written, or at least it’s not immediately clear how it does. Here’s another attempt:
RLHF attempts to infer a reward function from human comparisons of task completions. But it’s possible that a reward function learned from these stated preferences might not be the “actual” reward function—even if we could perfectly predict the human preference ordering on the training set of task completions, it’s hard to guarantee that the learned reward model will generalize to all task completions. We also have to consider that the stated human preferences might be irrational: they could be intransitive or cyclical, for instance. It seems possible to me that a reward model learned from human feedback still has to account for human biases, just as a reward function learned through IRL does.
(I should clarify that I’m not an expert. In fact, you might even call me “an amateur who’s just learning about this stuff myself”! That said...)
RLHF attempts to infer a reward function from human comparisons of task completions.
I believe that RLHF more broadly refers to learning reward models via supervised learning, not just the special case where the labelled data is pairwise comparisons of task completions. So, for example, I think that RLHF would include e.g. learning a reward model for text summaries based on scalar 1-10 feedback from humans, rather than just pairwise comparisons of summaries.
On the topic of whether human biases present an issue for RLHF, I think it might be somewhat subtle. To tease apart a few different concerns you might have:
What if human preferences aren’t representable by a utility function (e.g. because they’re intransitive)? This doesn’t seem like an essential obstruction to RLHF, since whatever sort of data type human preferences are (e.g. mappings from triples (history of the world state, option 1, option 2) to {0,1}) I would expect them to still be learnable in principle via supervised learning. Of course, the more unconstrained our assumptions on human preferences, the harder it is to learn them (utility functions are harder to learn than mappings like the above), so we might run into practical issues. But I guess I don’t strongly expect that to happen—I feel like human preferences shouldn’t so unconstrained as to sink RLHF.
What if RLHF can never learn our true values because the feedback we give it is biased? In this case, I would expect RLHF to learn “biased human values” which … I guess I’m okay with? Like if we get an AI which is aligned with human values as revealed by stated preferences instead of the reflective equilibrium of human values that we get after correcting for our biases, I still expect that to keep us safe and buy us time to figure out our true values/build an aligned AI that can figure out our values for us. So if this is the biggest issue with RLHF then I feel like we’ve averted the worst outcomes.
It’s possible I’ve misunderstood the central concern about how RLHF interacts with human irrationality, so feel free to say if there’s a consideration I’ve missed!
What if human preferences aren’t representable by a utility function
I’m responding to this specifically, rather than the question of RLHF and ‘human irrationality’.
I’m not saying this is the case, but what if ‘human preferences’ are representable by something more complicated. Perhaps an array or vector? Can it learn something like that?
This was a nice post! I appreciate the effort you’re making to get your inside view out there.
A correction:
Based on this sentence, you might be conflating value learning (the broad class of approaches to outer alignment that involve learning reward models) with IRL, which is the particular sub-type of value learning in which the ML model tries to infer a reward function by observing the behavior of some agent whose behavior is assumed (approximately) optimal for said reward function. So, for example, IRL includes learning how to fly a helicopter by watching an expert, but not the approach used in “Learning to summarize from human feedback,” in which a reward model was trained via supervised learning from pairwise comparisons.
Relatedly, I’ll note that much (though not all) of the skepticism about value learning linked in the “Outer alignment concerns” section is IRL-specific. In more detail, many of the linked posts revolve around the IRL-specific issue of “How do you correct for your ‘expert demonstration’ actually being performed by an suboptimal human?[1]” But this concern doesn’t seem to apply to all types of value learning; for examples RLHF doesn’t require that humans to be approximately optimal at the task, only that we are able to judge completions of the task. (That said, I haven’t read the “Value Learning” sequence in detail, so it’s possible I’m misunderstanding and they actually explain how this concern generalizes to all value learning approaches?[2])
Unrelated to the point about IRL, my inside view agrees with yours that an important next step in RLHF is making it possible for humans to give richer feedback, e.g. natural language feedback, trajectory corrections, etc. I, too, was excited by the Reward-rational choice paper (if not the particular formalism proposed there, then the general thrust that we should have a framework for giving lots of different types of feedback to our AI systems). Conversely, my inside view finds CIRL less promising than yours does.
Human irrationality is one example of suboptimal human behavior, but there are others too. For example, a perfect IRL agent watching a human playing a video game in which perfect play requires super-human reflexes would infer that the human wanted to react slowly. So suboptimal behavior is an obstruction to both correctly inferring human values and producing a super-human agent via IRL.
According to me, the generalized version of this concern is “How do you get an agent whose reward function was learned via some value learning approach to have super-human performance?” The avatar of this for RLHF is scalable oversight, as you address in your post.
Thanks for the feedback and corrections! You’re right, I was definitely confusing IRL, which is one approach to value learning, with the value learning project as a whole. I think you’re also right that most of the “Outer alignment concerns” section doesn’t really apply to RLHF as it’s currently written, or at least it’s not immediately clear how it does. Here’s another attempt:
RLHF attempts to infer a reward function from human comparisons of task completions. But it’s possible that a reward function learned from these stated preferences might not be the “actual” reward function—even if we could perfectly predict the human preference ordering on the training set of task completions, it’s hard to guarantee that the learned reward model will generalize to all task completions. We also have to consider that the stated human preferences might be irrational: they could be intransitive or cyclical, for instance. It seems possible to me that a reward model learned from human feedback still has to account for human biases, just as a reward function learned through IRL does.
How’s that for a start?
(I should clarify that I’m not an expert. In fact, you might even call me “an amateur who’s just learning about this stuff myself”! That said...)
I believe that RLHF more broadly refers to learning reward models via supervised learning, not just the special case where the labelled data is pairwise comparisons of task completions. So, for example, I think that RLHF would include e.g. learning a reward model for text summaries based on scalar 1-10 feedback from humans, rather than just pairwise comparisons of summaries.
On the topic of whether human biases present an issue for RLHF, I think it might be somewhat subtle. To tease apart a few different concerns you might have:
What if human preferences aren’t representable by a utility function (e.g. because they’re intransitive)? This doesn’t seem like an essential obstruction to RLHF, since whatever sort of data type human preferences are (e.g. mappings from triples (history of the world state, option 1, option 2) to {0,1}) I would expect them to still be learnable in principle via supervised learning. Of course, the more unconstrained our assumptions on human preferences, the harder it is to learn them (utility functions are harder to learn than mappings like the above), so we might run into practical issues. But I guess I don’t strongly expect that to happen—I feel like human preferences shouldn’t so unconstrained as to sink RLHF.
What if RLHF can never learn our true values because the feedback we give it is biased? In this case, I would expect RLHF to learn “biased human values” which … I guess I’m okay with? Like if we get an AI which is aligned with human values as revealed by stated preferences instead of the reflective equilibrium of human values that we get after correcting for our biases, I still expect that to keep us safe and buy us time to figure out our true values/build an aligned AI that can figure out our values for us. So if this is the biggest issue with RLHF then I feel like we’ve averted the worst outcomes.
It’s possible I’ve misunderstood the central concern about how RLHF interacts with human irrationality, so feel free to say if there’s a consideration I’ve missed!
I’m responding to this specifically, rather than the question of RLHF and ‘human irrationality’.
I’m not saying this is the case, but what if ‘human preferences’ are representable by something more complicated. Perhaps an array or vector? Can it learn something like that?