I don’t see how RLHF could be framed as some kind of advance on the problem of outer alignment.
RLHF solves the “backflip” alignment problem : If you try to write a function for creating an agent doing a backflip, it will crash. But with RLHF, you obtain a nice backflip.
I mean, by this definition all capability-advances are solving alignment problems. Making your model bigger allows you to solve the “pretend to be Barack Obama” alignment problem. Hooking your model up to robots allows you to solve the “fold my towels” alignment problem. Making your model require less electricity to run solves the “don’t cause climate change” alignment problem.
I agree that there is of course some sense in which one could model all of these as “alignment problems”, but this does seem to be stretching the definition of the word alignment in a way that feels like it reduces its usefulness a lot.
RLHF solves the “backflip” alignment problem : If you try to write a function for creating an agent doing a backflip, it will crash. But with RLHF, you obtain a nice backflip.
I mean, by this definition all capability-advances are solving alignment problems. Making your model bigger allows you to solve the “pretend to be Barack Obama” alignment problem. Hooking your model up to robots allows you to solve the “fold my towels” alignment problem. Making your model require less electricity to run solves the “don’t cause climate change” alignment problem.
I agree that there is of course some sense in which one could model all of these as “alignment problems”, but this does seem to be stretching the definition of the word alignment in a way that feels like it reduces its usefulness a lot.