I just meant that the usual RLHF setup is essentially RL in which the reward is provided by a learned model, but I agree that I was stretching the way the terminology is normally used.
I just meant that the usual RLHF setup is essentially RL in which the reward is provided by a learned model, but I agree that I was stretching the way the terminology is normally used.