and in particular the abstraction which it seems John is using, where making progress on outer alignment makes almost no difference to inner alignment
I am confused. How does RLHF help with outer alignment? Isn’t optimizing fur human approval the classical outer-alignment problem? (e.g. tiling the universe with smiling faces)
I don’t think the argument for RLHF runs through outer alignment. I think it has to run through using it as a lens to study how models generalize, and eliciting misalignment (i.e. the points about empirical data that you mentioned, I just don’t understand where the inner/outer alignment distinction comes from in this context)
The smiley faces example feels confusing as a “classic” outer alignment problem because AGIs won’t be trained on a reward function anywhere near as limited as smiley faces. An alternative like “AGIs are trained on a reward function in which all behavior on a wide range of tasks is classified by humans as good or bad” feels more realistic, but also lacks the intuitive force of the smiley face example—it’s much less clear in this example why generalization will go badly, given the breadth of the data collected.
I think the smiling example is much more analogous than you are making it out here. I think the basic argument for “this just encourages taking control of the reward” or “this just encourages deception” goes through the same way.
Like, RLHF is not some magical “we have definitely figured out whether a behavior is really good or bad” signal, it’s historically been just some contractors thinking for like a minute about whether a thing is fine. I don’t think there is less bayesian evidence conveyed by people smiling (like, the variance in smiling is greater than the variance in RLHF approval, and so the amount of information conveyed is actually more), so I don’t buy that RLHF conveys more about human preferences in any meaningful way.
I am confused. How does RLHF help with outer alignment? Isn’t optimizing fur human approval the classical outer-alignment problem? (e.g. tiling the universe with smiling faces)
I don’t think the argument for RLHF runs through outer alignment. I think it has to run through using it as a lens to study how models generalize, and eliciting misalignment (i.e. the points about empirical data that you mentioned, I just don’t understand where the inner/outer alignment distinction comes from in this context)
RLHF helps with outer alignment because it leads to rewards which more accurately reflect human preferences than the hard-coded reward functions (including the classic specification gaming examples, but also intrinsic motivation functions like curiosity and empowerment) which are used to train agents in the absence of RLHF.
The smiley faces example feels confusing as a “classic” outer alignment problem because AGIs won’t be trained on a reward function anywhere near as limited as smiley faces. An alternative like “AGIs are trained on a reward function in which all behavior on a wide range of tasks is classified by humans as good or bad” feels more realistic, but also lacks the intuitive force of the smiley face example—it’s much less clear in this example why generalization will go badly, given the breadth of the data collected.
I think the smiling example is much more analogous than you are making it out here. I think the basic argument for “this just encourages taking control of the reward” or “this just encourages deception” goes through the same way.
Like, RLHF is not some magical “we have definitely figured out whether a behavior is really good or bad” signal, it’s historically been just some contractors thinking for like a minute about whether a thing is fine. I don’t think there is less bayesian evidence conveyed by people smiling (like, the variance in smiling is greater than the variance in RLHF approval, and so the amount of information conveyed is actually more), so I don’t buy that RLHF conveys more about human preferences in any meaningful way.