Yes, it’s true that RLHF breaks if you apply an arbitrary optimization pressure, and that’s why you have to put a KL divergence, and that this KL divergence is difficult to calibrate.
I don’t understand why the outer/inner decomposition is relevant in my question. I am only talking about the outer.
Point 3 is wrong, because in the instructGPT paper, GPT with RLHF is more rated than the fine-tuned GPT.
Yes, it’s true that RLHF breaks if you apply an arbitrary optimization pressure, and that’s why you have to put a KL divergence, and that this KL divergence is difficult to calibrate.
I don’t understand why the outer/inner decomposition is relevant in my question. I am only talking about the outer.
Point 3 is wrong, because in the instructGPT paper, GPT with RLHF is more rated than the fine-tuned GPT.