The arguments around RL here could equally apply to supervised fine-tuning.
Methods such as supervised fine-tuning also risk distributional collapse when the objective is to maximize the prediction’s correctness without preserving the model’s original distributional properties.
The arguments around RL here could equally apply to supervised fine-tuning.
Methods such as supervised fine-tuning also risk distributional collapse when the objective is to maximize the prediction’s correctness without preserving the model’s original distributional properties.