Not important, but I don’t think RLHF can qualify as model-based RL. We usually use PPO in RLHF, and it’s a model-free RL algorithm.
I just meant that the usual RLHF setup is essentially RL in which the reward is provided by a learned model, but I agree that I was stretching the way the terminology is normally used.
Not important, but I don’t think RLHF can qualify as model-based RL. We usually use PPO in RLHF, and it’s a model-free RL algorithm.
I just meant that the usual RLHF setup is essentially RL in which the reward is provided by a learned model, but I agree that I was stretching the way the terminology is normally used.