Assuming that it was fine tuned with RLHF (which OpenAI has hinted at with much eyebrow wiggling but not to my knowledge confirmed) then it does have some special sauce. Roughly,
- if it’s at the beginning of a story,
-and the base model predicts [“Once”: 10%, “It”: 10%, … “Happy”: 5% …]
-and then during RLHF, the 10% of the time it starts with “Once” it writes a generic story and gets lots of reward, but when it outputs “Happy” it tries to write in the style of Tolstoy and bungles it, getting little reward
=> it will update to output Once more often in that situation.
The KL divergence between successive updates is bounded by the PPO algorithm, but over many updates it can shift from [“Once”: 10%, “It”: 10%, … “Happy”: 5% …] to [“Once”: 90%, “It”: 5%, … “Happy”: 1% …] if the final results from starting with Once are reliably better.
It’s hard to say if that means it’s planning to write a generic story because of an agentic desire to become a hack and please the masses, but certainly it’s changing its output distribution based on what happened many tokens in the future
A limitation of this approach is that it introduces an “alignment tax”: aligning the models only on customer tasks can make their performance worse on some other academic NLP tasks. This is undesirable since, if our alignment techniques make models worse on tasks that people care about, they’re less likely to be adopted in practice. We’ve found a simple algorithmic change that minimizes this alignment tax: during RL fine-tuning we mix in a small fraction of the original data used to train GPT-3, and train on this data using the normal log likelihood maximization.[ We found this approach more effective than simply increasing the KL coefficient.] This roughly maintains performance on safety and human preferences, while mitigating performance decreases on academic tasks, and in several cases even surpassing the GPT-3 baseline.
Also, I think you have it subtly wrong: it’s not just a KL constraint each step. (PPO already constrains step size.) It’s a KL constraint for total divergence from the original baseline supervised model: https://arxiv.org/pdf/2009.01325.pdf#page=6https://arxiv.org/abs/1907.00456 So it does have limits to how much it can shift probabilities in toto.
Assuming that it was fine tuned with RLHF (which OpenAI has hinted at with much eyebrow wiggling but not to my knowledge confirmed) then it does have some special sauce. Roughly,
- if it’s at the beginning of a story,
-and the base model predicts [“Once”: 10%, “It”: 10%, … “Happy”: 5% …]
-and then during RLHF, the 10% of the time it starts with “Once” it writes a generic story and gets lots of reward, but when it outputs “Happy” it tries to write in the style of Tolstoy and bungles it, getting little reward
=> it will update to output Once more often in that situation.
The KL divergence between successive updates is bounded by the PPO algorithm, but over many updates it can shift from [“Once”: 10%, “It”: 10%, … “Happy”: 5% …] to [“Once”: 90%, “It”: 5%, … “Happy”: 1% …] if the final results from starting with Once are reliably better.
It’s hard to say if that means it’s planning to write a generic story because of an agentic desire to become a hack and please the masses, but certainly it’s changing its output distribution based on what happened many tokens in the future
One wrinkle is that (sigh) it’s not just a KL constraint anymore: now it’s a KL constraint and also some regular log-likelihood training on original raw data to maintain generality: https://openai.com/blog/instruction-following/ https://arxiv.org/pdf/2203.02155.pdf#page=15
Also, I think you have it subtly wrong: it’s not just a KL constraint each step. (PPO already constrains step size.) It’s a KL constraint for total divergence from the original baseline supervised model: https://arxiv.org/pdf/2009.01325.pdf#page=6 https://arxiv.org/abs/1907.00456 So it does have limits to how much it can shift probabilities in toto.
Interesting. I’ll have to think about it a bit. But I don’t have to think at all to nix the idea of agentic desire.