this didn’t work for their test cases: “Training the reward predictor offline can lead to bizarre behavior […] This type of behavior demonstrates that in general human feedback needs to be intertwined with RL rather than provided statically.” I don’t know what to make of this.
I think in the original paper, they don’t have the KL term that prevents the policy from overfitting to the reward model, which seems sufficient to explain this. (Also more speculatively I’d guess that using bigger models on more realistic tasks probably leads to the reward model generalizing better, so that optimization in batches becomes more feasible.)
After all, if they can, then you can just skip the RL, have humans explicitly tell you “no that token is bad, yes this token is great,” and train on likelihood.
Don’t you still need a model that converts from human preferences over tokens to likelihoods? It sounds to me that the architecture you’re suggesting is like theirs, except using a horizon of 1. Or perhaps you don’t want to use a learned reward model, and instead you want some hardcoded method of converting human preferences over tokens into <thing that can be plugged into an ML algorithm>?
The original paper & codebase definitely had KL penalties on the PPO policy. I spent a fair bit of time fiddling with it and letting it go high to see what adversarial ABC music examples it found in the hopes that it would train the reward model better when I labeled them. Didn’t seem to work, it would just find similar and only slightly different examples.
The latter. I didn’t notice it was a link to a different paper, but I think my point stands: the better results in this paper compared to the previous finetuning paper can’t be due to adding the KL constraint because they already had one. It has to be something else they changed, like more/better labels or bigger models.
Yeah, I definitely agree with that, I was just responding to the confusion that (I think) nostalgebraist had. Relative to the latter paper, I’d guess increased performance is primarily due to label quality and larger model.
I think in the original paper, they don’t have the KL term that prevents the policy from overfitting to the reward model, which seems sufficient to explain this.
Yeah, that makes sense. Something like this explanation occurred to me yesterday, after finishing the post—I was reading over the funny samples I quoted at the end and thought “huh, that would qualify as ‘bizarre behavior,’ wouldn’t it?”
Or perhaps you don’t want to use a learned reward model, and instead you want some hardcoded method of converting human preferences over tokens into <thing that can be plugged into an ML algorithm>?
If I understand you, yes, this is what I want. My intuition here is based on:
at the end of the day, our final model will be sampling one token at a time, like the original LM; we just want it to output better probabilities
when OpenAI (and I) think about what “better probabilities” we want in specific cases, our preference often looks localized to specific tokens and identifiable using only preceding context, e.g. to specific “made-up” facts, or the kind of synthetic errors they introduce in Table 18
So, it feels like “we” want the LM to have different probabilities in specific places, and we can often point to these exact places and at least say whether the probability should be higher or lower.
Insofar as this is true, it means our true preferences look a lot like what the original LM is trained to do. If I’m annotating to improve an LM for nonfiction writing, and I see “Paris, the capital of Canada,” what I really want is to make the token ” Canada” less probable in this context.
This is a preference over next-token probabilities, not sequences—if I compress it down to a preference over whole sequences, I must be hoping the models will later decompress it back to my true preference. It seems needlessly convoluted to translate my preferences out of LM terms and then use RL to translate them back, when they’re naturally in LM terms to begin with.
This doesn’t get you all the way to having a unique loss: the most obvious thing would be to ascend likelihood for tokens marked “good” and descend for tokens marked “bad,” but there may be conceptually similar losses that are better-behaved in training.
Some versions of this would look like RL with a horizon of 1 and the rewards given by my annotations plus a KL penalty, but note that this is very different from their approach, where there isn’t a “horizon” per se because all episodes have a fixed duration and receive rewards only at the end.
where there isn’t a “horizon” per se because all episodes have a fixed duration and receive rewards only at the end.
I’m confused how this is not a horizon? Perhaps we’re using words differently—I’m saying “there’s a hyperparameter that controls the number of timesteps over which credit assignment must be performed; in their setting it’s the sentence length and in your setting it is 1; nothing else would need to change”.
To me the concept of a horizon (or a discount factor) as a hyperparameter no longer makes sense when there’s only a single reward at the end of the episode, as in the paper here or in AlphaGo/Zero. They only make sense with intermediate reward, as in Atari or my proposal here.
With only final rewards, you can still include it as a variable formally. but there’s no reason to make that variable anything less than the episode length. (If the horizon is n steps lower than the episode length, this just sets the value function identically to 0 for the first n steps.)
I guess I was using “there isn’t a horizon per se” to mean “the time structure of the rewards determines the horizon for you, it wouldn’t make sense to vary it,” but I can see how that would be confusing.
If you only set the horizon to 1 but changed nothing else in their work, you’d get a dumb policy that equals the initial LM until the very last token, which it treats as responsible for the entire reward. If you add intermediate rewards and set the horizon to 1, you get something more sensible.
I think in the original paper, they don’t have the KL term that prevents the policy from overfitting to the reward model, which seems sufficient to explain this. (Also more speculatively I’d guess that using bigger models on more realistic tasks probably leads to the reward model generalizing better, so that optimization in batches becomes more feasible.)
Don’t you still need a model that converts from human preferences over tokens to likelihoods? It sounds to me that the architecture you’re suggesting is like theirs, except using a horizon of 1. Or perhaps you don’t want to use a learned reward model, and instead you want some hardcoded method of converting human preferences over tokens into <thing that can be plugged into an ML algorithm>?
The original paper & codebase definitely had KL penalties on the PPO policy. I spent a fair bit of time fiddling with it and letting it go high to see what adversarial ABC music examples it found in the hopes that it would train the reward model better when I labeled them. Didn’t seem to work, it would just find similar and only slightly different examples.
By “original paper” do you mean Deep RL from Human Preferences or Fine-Tuning Language Models from Human Preferences? The latter did have a KL penalty, but OP linked to the former. I just skimmed the former again and saw no mention of a KL penalty (but I easily could have missed it).
The latter. I didn’t notice it was a link to a different paper, but I think my point stands: the better results in this paper compared to the previous finetuning paper can’t be due to adding the KL constraint because they already had one. It has to be something else they changed, like more/better labels or bigger models.
Yeah, I definitely agree with that, I was just responding to the confusion that (I think) nostalgebraist had. Relative to the latter paper, I’d guess increased performance is primarily due to label quality and larger model.
Yeah, that makes sense. Something like this explanation occurred to me yesterday, after finishing the post—I was reading over the funny samples I quoted at the end and thought “huh, that would qualify as ‘bizarre behavior,’ wouldn’t it?”
If I understand you, yes, this is what I want. My intuition here is based on:
at the end of the day, our final model will be sampling one token at a time, like the original LM; we just want it to output better probabilities
when OpenAI (and I) think about what “better probabilities” we want in specific cases, our preference often looks localized to specific tokens and identifiable using only preceding context, e.g. to specific “made-up” facts, or the kind of synthetic errors they introduce in Table 18
So, it feels like “we” want the LM to have different probabilities in specific places, and we can often point to these exact places and at least say whether the probability should be higher or lower.
Insofar as this is true, it means our true preferences look a lot like what the original LM is trained to do. If I’m annotating to improve an LM for nonfiction writing, and I see “Paris, the capital of Canada,” what I really want is to make the token ” Canada” less probable in this context.
This is a preference over next-token probabilities, not sequences—if I compress it down to a preference over whole sequences, I must be hoping the models will later decompress it back to my true preference. It seems needlessly convoluted to translate my preferences out of LM terms and then use RL to translate them back, when they’re naturally in LM terms to begin with.
This doesn’t get you all the way to having a unique loss: the most obvious thing would be to ascend likelihood for tokens marked “good” and descend for tokens marked “bad,” but there may be conceptually similar losses that are better-behaved in training.
Some versions of this would look like RL with a horizon of 1 and the rewards given by my annotations plus a KL penalty, but note that this is very different from their approach, where there isn’t a “horizon” per se because all episodes have a fixed duration and receive rewards only at the end.
That all makes sense, except for this part:
I’m confused how this is not a horizon? Perhaps we’re using words differently—I’m saying “there’s a hyperparameter that controls the number of timesteps over which credit assignment must be performed; in their setting it’s the sentence length and in your setting it is 1; nothing else would need to change”.
To me the concept of a horizon (or a discount factor) as a hyperparameter no longer makes sense when there’s only a single reward at the end of the episode, as in the paper here or in AlphaGo/Zero. They only make sense with intermediate reward, as in Atari or my proposal here.
With only final rewards, you can still include it as a variable formally. but there’s no reason to make that variable anything less than the episode length. (If the horizon is n steps lower than the episode length, this just sets the value function identically to 0 for the first n steps.)
I guess I was using “there isn’t a horizon per se” to mean “the time structure of the rewards determines the horizon for you, it wouldn’t make sense to vary it,” but I can see how that would be confusing.
If you only set the horizon to 1 but changed nothing else in their work, you’d get a dumb policy that equals the initial LM until the very last token, which it treats as responsible for the entire reward. If you add intermediate rewards and set the horizon to 1, you get something more sensible.
Ah got it, that makes sense, I agree with all of that.