I think the smiling example is much more analogous than you are making it out here. I think the basic argument for “this just encourages taking control of the reward” or “this just encourages deception” goes through the same way.
Like, RLHF is not some magical “we have definitely figured out whether a behavior is really good or bad” signal, it’s historically been just some contractors thinking for like a minute about whether a thing is fine. I don’t think there is less bayesian evidence conveyed by people smiling (like, the variance in smiling is greater than the variance in RLHF approval, and so the amount of information conveyed is actually more), so I don’t buy that RLHF conveys more about human preferences in any meaningful way.
I think the smiling example is much more analogous than you are making it out here. I think the basic argument for “this just encourages taking control of the reward” or “this just encourages deception” goes through the same way.
Like, RLHF is not some magical “we have definitely figured out whether a behavior is really good or bad” signal, it’s historically been just some contractors thinking for like a minute about whether a thing is fine. I don’t think there is less bayesian evidence conveyed by people smiling (like, the variance in smiling is greater than the variance in RLHF approval, and so the amount of information conveyed is actually more), so I don’t buy that RLHF conveys more about human preferences in any meaningful way.