Can you talk more about why RL4HF is “concealing problems”? Do you mean “attempting alignment” in a way that other people won’t, or something else?
Roughly, “avoid your actions being labelled as bad by humans [or models of humans]” is not quite the same signal as “don’t be bad”.
Ah ok, so you’re saying RL4HF is bad if it’s the action model. But it seems fine if it’s done to the reward model, right?
What do you mean by “RLHF is done to the reward model”, and why would that be fine?
You can use an LLM to ask what actions to take, or you can use an LLM to ask “hey is this a good world state?” The latter seems like it might capture a lot of human semantics about value given RL4HF
Can you talk more about why RL4HF is “concealing problems”? Do you mean “attempting alignment” in a way that other people won’t, or something else?
Roughly, “avoid your actions being labelled as bad by humans [or models of humans]” is not quite the same signal as “don’t be bad”.
Ah ok, so you’re saying RL4HF is bad if it’s the action model. But it seems fine if it’s done to the reward model, right?
What do you mean by “RLHF is done to the reward model”, and why would that be fine?
You can use an LLM to ask what actions to take, or you can use an LLM to ask “hey is this a good world state?” The latter seems like it might capture a lot of human semantics about value given RL4HF