This seems like it highlights that it’s vital for current fine-tuned models to change the output distribution only a little (e.g. small KL divergence between base model and finetuned model). If they change the distribution a lot, they’ll run into unintended optima, but the base distribution serves as a reasonable prior / reasonable set of underlying dynamics for the text to follow when the fine-tuned model isn’t “spending KL divergence” to change its path.
Except it’s still weird how bad the reward model is—it’s not like the reward model was trained based on the behavior it produced (like humans’ genetic code was), it’s just supervised learning on human reviews.
Fun read!
This seems like it highlights that it’s vital for current fine-tuned models to change the output distribution only a little (e.g. small KL divergence between base model and finetuned model). If they change the distribution a lot, they’ll run into unintended optima, but the base distribution serves as a reasonable prior / reasonable set of underlying dynamics for the text to follow when the fine-tuned model isn’t “spending KL divergence” to change its path.
Except it’s still weird how bad the reward model is—it’s not like the reward model was trained based on the behavior it produced (like humans’ genetic code was), it’s just supervised learning on human reviews.