A lot of historical work on alignment seems like it addresses subsets of the problems solved by RLHF, but doesn’t actually address the important ways in which RLHF fails. In particular, a lot of that work is only necessary if RLHF is prohibitively sample-inefficient.
Do you have examples of such historical work that you’re happy to name? I’m really unsure what you’re referring to (probably just because I haven’t been involved in alignment for long enough).
I think a lot of work on IRL and similar techniques has this issue—it’s mostly designed to learn from indirect forms of evidence about value, but in many cases the primary upside is data efficiency and in fact the inferences about preferences are predictably worse than in RLHF.
(I think you can also do IRL work with a real chance of overcoming limitations of RLHF, but most researchers are not careful about thinking through what should be the central issue.)
Do you have examples of such historical work that you’re happy to name? I’m really unsure what you’re referring to (probably just because I haven’t been involved in alignment for long enough).
I think a lot of work on IRL and similar techniques has this issue—it’s mostly designed to learn from indirect forms of evidence about value, but in many cases the primary upside is data efficiency and in fact the inferences about preferences are predictably worse than in RLHF.
(I think you can also do IRL work with a real chance of overcoming limitations of RLHF, but most researchers are not careful about thinking through what should be the central issue.)