Thanks for the feedback and corrections! You’re right, I was definitely confusing IRL, which is one approach to value learning, with the value learning project as a whole. I think you’re also right that most of the “Outer alignment concerns” section doesn’t really apply to RLHF as it’s currently written, or at least it’s not immediately clear how it does. Here’s another attempt:
RLHF attempts to infer a reward function from human comparisons of task completions. But it’s possible that a reward function learned from these stated preferences might not be the “actual” reward function—even if we could perfectly predict the human preference ordering on the training set of task completions, it’s hard to guarantee that the learned reward model will generalize to all task completions. We also have to consider that the stated human preferences might be irrational: they could be intransitive or cyclical, for instance. It seems possible to me that a reward model learned from human feedback still has to account for human biases, just as a reward function learned through IRL does.
(I should clarify that I’m not an expert. In fact, you might even call me “an amateur who’s just learning about this stuff myself”! That said...)
RLHF attempts to infer a reward function from human comparisons of task completions.
I believe that RLHF more broadly refers to learning reward models via supervised learning, not just the special case where the labelled data is pairwise comparisons of task completions. So, for example, I think that RLHF would include e.g. learning a reward model for text summaries based on scalar 1-10 feedback from humans, rather than just pairwise comparisons of summaries.
On the topic of whether human biases present an issue for RLHF, I think it might be somewhat subtle. To tease apart a few different concerns you might have:
What if human preferences aren’t representable by a utility function (e.g. because they’re intransitive)? This doesn’t seem like an essential obstruction to RLHF, since whatever sort of data type human preferences are (e.g. mappings from triples (history of the world state, option 1, option 2) to {0,1}) I would expect them to still be learnable in principle via supervised learning. Of course, the more unconstrained our assumptions on human preferences, the harder it is to learn them (utility functions are harder to learn than mappings like the above), so we might run into practical issues. But I guess I don’t strongly expect that to happen—I feel like human preferences shouldn’t so unconstrained as to sink RLHF.
What if RLHF can never learn our true values because the feedback we give it is biased? In this case, I would expect RLHF to learn “biased human values” which … I guess I’m okay with? Like if we get an AI which is aligned with human values as revealed by stated preferences instead of the reflective equilibrium of human values that we get after correcting for our biases, I still expect that to keep us safe and buy us time to figure out our true values/build an aligned AI that can figure out our values for us. So if this is the biggest issue with RLHF then I feel like we’ve averted the worst outcomes.
It’s possible I’ve misunderstood the central concern about how RLHF interacts with human irrationality, so feel free to say if there’s a consideration I’ve missed!
What if human preferences aren’t representable by a utility function
I’m responding to this specifically, rather than the question of RLHF and ‘human irrationality’.
I’m not saying this is the case, but what if ‘human preferences’ are representable by something more complicated. Perhaps an array or vector? Can it learn something like that?
Thanks for the feedback and corrections! You’re right, I was definitely confusing IRL, which is one approach to value learning, with the value learning project as a whole. I think you’re also right that most of the “Outer alignment concerns” section doesn’t really apply to RLHF as it’s currently written, or at least it’s not immediately clear how it does. Here’s another attempt:
RLHF attempts to infer a reward function from human comparisons of task completions. But it’s possible that a reward function learned from these stated preferences might not be the “actual” reward function—even if we could perfectly predict the human preference ordering on the training set of task completions, it’s hard to guarantee that the learned reward model will generalize to all task completions. We also have to consider that the stated human preferences might be irrational: they could be intransitive or cyclical, for instance. It seems possible to me that a reward model learned from human feedback still has to account for human biases, just as a reward function learned through IRL does.
How’s that for a start?
(I should clarify that I’m not an expert. In fact, you might even call me “an amateur who’s just learning about this stuff myself”! That said...)
I believe that RLHF more broadly refers to learning reward models via supervised learning, not just the special case where the labelled data is pairwise comparisons of task completions. So, for example, I think that RLHF would include e.g. learning a reward model for text summaries based on scalar 1-10 feedback from humans, rather than just pairwise comparisons of summaries.
On the topic of whether human biases present an issue for RLHF, I think it might be somewhat subtle. To tease apart a few different concerns you might have:
What if human preferences aren’t representable by a utility function (e.g. because they’re intransitive)? This doesn’t seem like an essential obstruction to RLHF, since whatever sort of data type human preferences are (e.g. mappings from triples (history of the world state, option 1, option 2) to {0,1}) I would expect them to still be learnable in principle via supervised learning. Of course, the more unconstrained our assumptions on human preferences, the harder it is to learn them (utility functions are harder to learn than mappings like the above), so we might run into practical issues. But I guess I don’t strongly expect that to happen—I feel like human preferences shouldn’t so unconstrained as to sink RLHF.
What if RLHF can never learn our true values because the feedback we give it is biased? In this case, I would expect RLHF to learn “biased human values” which … I guess I’m okay with? Like if we get an AI which is aligned with human values as revealed by stated preferences instead of the reflective equilibrium of human values that we get after correcting for our biases, I still expect that to keep us safe and buy us time to figure out our true values/build an aligned AI that can figure out our values for us. So if this is the biggest issue with RLHF then I feel like we’ve averted the worst outcomes.
It’s possible I’ve misunderstood the central concern about how RLHF interacts with human irrationality, so feel free to say if there’s a consideration I’ve missed!
I’m responding to this specifically, rather than the question of RLHF and ‘human irrationality’.
I’m not saying this is the case, but what if ‘human preferences’ are representable by something more complicated. Perhaps an array or vector? Can it learn something like that?