A limitation of this approach is that it introduces an “alignment tax”: aligning the models only on customer tasks can make their performance worse on some other academic NLP tasks. This is undesirable since, if our alignment techniques make models worse on tasks that people care about, they’re less likely to be adopted in practice. We’ve found a simple algorithmic change that minimizes this alignment tax: during RL fine-tuning we mix in a small fraction of the original data used to train GPT-3, and train on this data using the normal log likelihood maximization.[ We found this approach more effective than simply increasing the KL coefficient.] This roughly maintains performance on safety and human preferences, while mitigating performance decreases on academic tasks, and in several cases even surpassing the GPT-3 baseline.
Also, I think you have it subtly wrong: it’s not just a KL constraint each step. (PPO already constrains step size.) It’s a KL constraint for total divergence from the original baseline supervised model: https://arxiv.org/pdf/2009.01325.pdf#page=6https://arxiv.org/abs/1907.00456 So it does have limits to how much it can shift probabilities in toto.
One wrinkle is that (sigh) it’s not just a KL constraint anymore: now it’s a KL constraint and also some regular log-likelihood training on original raw data to maintain generality: https://openai.com/blog/instruction-following/ https://arxiv.org/pdf/2203.02155.pdf#page=15
Also, I think you have it subtly wrong: it’s not just a KL constraint each step. (PPO already constrains step size.) It’s a KL constraint for total divergence from the original baseline supervised model: https://arxiv.org/pdf/2009.01325.pdf#page=6 https://arxiv.org/abs/1907.00456 So it does have limits to how much it can shift probabilities in toto.