Is there evidence that RLHF training improves robustness compared to regular fine-tuning? Is text-davinci-002, trained with supervised fine-tuning, significantly less robust to adversaries than text-davinci-003, trained with RLHF?
The first LaMDA only used SL. (I’m not sure whether LaMDA 2 or the current version of LaMDA still use SL only.) Meanwhile OpenAI switched from pure SL to SL+RL. Anthropic also uses SL+RL (though no longer RLHF specifically). So apparently SL+RL has proven more effective for fine-tuning than pure SL.
Why SL anyway, why not pure RL? Apparently because you have to get the model first to answer your questions and instructions, rather than just predicting text, before you can reward good responses via RL. (There should be more details in the InstructGPT paper and the more recent Constitutional AI paper.)
The first LaMDA only used SL. (I’m not sure whether LaMDA 2 or the current version of LaMDA still use SL only.) Meanwhile OpenAI switched from pure SL to SL+RL. Anthropic also uses SL+RL (though no longer RLHF specifically). So apparently SL+RL has proven more effective for fine-tuning than pure SL.
Why SL anyway, why not pure RL? Apparently because you have to get the model first to answer your questions and instructions, rather than just predicting text, before you can reward good responses via RL. (There should be more details in the InstructGPT paper and the more recent Constitutional AI paper.)