What do you mean by “RLHF is done to the reward model”, and why would that be fine?
You can use an LLM to ask what actions to take, or you can use an LLM to ask “hey is this a good world state?” The latter seems like it might capture a lot of human semantics about value given RL4HF
What do you mean by “RLHF is done to the reward model”, and why would that be fine?
You can use an LLM to ask what actions to take, or you can use an LLM to ask “hey is this a good world state?” The latter seems like it might capture a lot of human semantics about value given RL4HF