It’s asymmetric: it blows up when the data is very unlikely according to Q, which amounts to seeing something happen that you thought was nearly impossible, but not when the data is very unlikely according to P, which amounts to not seeing something that you thought was reasonably likely.
We—I mean, not we, but the maniacs who are hell-bent on destroying this world—include a DKL(πRLHF||πbase) penalty term in the RL objective because they don’t want the updated policy to output tokens that would be vanishingly unlikely coming from the base language model.
But your specific example of threats and promises isn’t vanishingly unlikely according to the base model! Common Crawl webtext is going to contain a lot of natural language reasoning about threats and promises! It’s true, in a sense, that the function of the KL penalty term is to “stay close” to the base policy. But you need to think about what that means mechanistically; you can’t just reason that the webtext prior is somehow “safe” in way that means staying KL-close to it is safe.
But you probably won’t understand what I’m talking about for another 70 days.
Doomimir: No, it wouldn’t! Are you retarded?
Simplicia: [apologetically] Well, actually …
Doomimir: [embarrassed] I’m sorry, Simplicia Optimistovna; I shouldn’t have snapped at you like that.
[diplomatically] But I think you’ve grievously misunderstood what the KL penalty in the RLHF objective is doing. Recall that the Kullback–Leibler divergence DKL(P||Q) represents how surprised you’d be by data from distribution P, that you expected to be from distribution Q.
It’s asymmetric: it blows up when the data is very unlikely according to Q, which amounts to seeing something happen that you thought was nearly impossible, but not when the data is very unlikely according to P, which amounts to not seeing something that you thought was reasonably likely.
We—I mean, not we, but the maniacs who are hell-bent on destroying this world—include a DKL(πRLHF||πbase) penalty term in the RL objective because they don’t want the updated policy to output tokens that would be vanishingly unlikely coming from the base language model.
But your specific example of threats and promises isn’t vanishingly unlikely according to the base model! Common Crawl webtext is going to contain a lot of natural language reasoning about threats and promises! It’s true, in a sense, that the function of the KL penalty term is to “stay close” to the base policy. But you need to think about what that means mechanistically; you can’t just reason that the webtext prior is somehow “safe” in way that means staying KL-close to it is safe.
But you probably won’t understand what I’m talking about for another 70 days.