I think in the original paper, they don’t have the KL term that prevents the policy from overfitting to the reward model, which seems sufficient to explain this.
Yeah, that makes sense. Something like this explanation occurred to me yesterday, after finishing the post—I was reading over the funny samples I quoted at the end and thought “huh, that would qualify as ‘bizarre behavior,’ wouldn’t it?”
Or perhaps you don’t want to use a learned reward model, and instead you want some hardcoded method of converting human preferences over tokens into <thing that can be plugged into an ML algorithm>?
If I understand you, yes, this is what I want. My intuition here is based on:
at the end of the day, our final model will be sampling one token at a time, like the original LM; we just want it to output better probabilities
when OpenAI (and I) think about what “better probabilities” we want in specific cases, our preference often looks localized to specific tokens and identifiable using only preceding context, e.g. to specific “made-up” facts, or the kind of synthetic errors they introduce in Table 18
So, it feels like “we” want the LM to have different probabilities in specific places, and we can often point to these exact places and at least say whether the probability should be higher or lower.
Insofar as this is true, it means our true preferences look a lot like what the original LM is trained to do. If I’m annotating to improve an LM for nonfiction writing, and I see “Paris, the capital of Canada,” what I really want is to make the token ” Canada” less probable in this context.
This is a preference over next-token probabilities, not sequences—if I compress it down to a preference over whole sequences, I must be hoping the models will later decompress it back to my true preference. It seems needlessly convoluted to translate my preferences out of LM terms and then use RL to translate them back, when they’re naturally in LM terms to begin with.
This doesn’t get you all the way to having a unique loss: the most obvious thing would be to ascend likelihood for tokens marked “good” and descend for tokens marked “bad,” but there may be conceptually similar losses that are better-behaved in training.
Some versions of this would look like RL with a horizon of 1 and the rewards given by my annotations plus a KL penalty, but note that this is very different from their approach, where there isn’t a “horizon” per se because all episodes have a fixed duration and receive rewards only at the end.
where there isn’t a “horizon” per se because all episodes have a fixed duration and receive rewards only at the end.
I’m confused how this is not a horizon? Perhaps we’re using words differently—I’m saying “there’s a hyperparameter that controls the number of timesteps over which credit assignment must be performed; in their setting it’s the sentence length and in your setting it is 1; nothing else would need to change”.
To me the concept of a horizon (or a discount factor) as a hyperparameter no longer makes sense when there’s only a single reward at the end of the episode, as in the paper here or in AlphaGo/Zero. They only make sense with intermediate reward, as in Atari or my proposal here.
With only final rewards, you can still include it as a variable formally. but there’s no reason to make that variable anything less than the episode length. (If the horizon is n steps lower than the episode length, this just sets the value function identically to 0 for the first n steps.)
I guess I was using “there isn’t a horizon per se” to mean “the time structure of the rewards determines the horizon for you, it wouldn’t make sense to vary it,” but I can see how that would be confusing.
If you only set the horizon to 1 but changed nothing else in their work, you’d get a dumb policy that equals the initial LM until the very last token, which it treats as responsible for the entire reward. If you add intermediate rewards and set the horizon to 1, you get something more sensible.
Yeah, that makes sense. Something like this explanation occurred to me yesterday, after finishing the post—I was reading over the funny samples I quoted at the end and thought “huh, that would qualify as ‘bizarre behavior,’ wouldn’t it?”
If I understand you, yes, this is what I want. My intuition here is based on:
at the end of the day, our final model will be sampling one token at a time, like the original LM; we just want it to output better probabilities
when OpenAI (and I) think about what “better probabilities” we want in specific cases, our preference often looks localized to specific tokens and identifiable using only preceding context, e.g. to specific “made-up” facts, or the kind of synthetic errors they introduce in Table 18
So, it feels like “we” want the LM to have different probabilities in specific places, and we can often point to these exact places and at least say whether the probability should be higher or lower.
Insofar as this is true, it means our true preferences look a lot like what the original LM is trained to do. If I’m annotating to improve an LM for nonfiction writing, and I see “Paris, the capital of Canada,” what I really want is to make the token ” Canada” less probable in this context.
This is a preference over next-token probabilities, not sequences—if I compress it down to a preference over whole sequences, I must be hoping the models will later decompress it back to my true preference. It seems needlessly convoluted to translate my preferences out of LM terms and then use RL to translate them back, when they’re naturally in LM terms to begin with.
This doesn’t get you all the way to having a unique loss: the most obvious thing would be to ascend likelihood for tokens marked “good” and descend for tokens marked “bad,” but there may be conceptually similar losses that are better-behaved in training.
Some versions of this would look like RL with a horizon of 1 and the rewards given by my annotations plus a KL penalty, but note that this is very different from their approach, where there isn’t a “horizon” per se because all episodes have a fixed duration and receive rewards only at the end.
That all makes sense, except for this part:
I’m confused how this is not a horizon? Perhaps we’re using words differently—I’m saying “there’s a hyperparameter that controls the number of timesteps over which credit assignment must be performed; in their setting it’s the sentence length and in your setting it is 1; nothing else would need to change”.
To me the concept of a horizon (or a discount factor) as a hyperparameter no longer makes sense when there’s only a single reward at the end of the episode, as in the paper here or in AlphaGo/Zero. They only make sense with intermediate reward, as in Atari or my proposal here.
With only final rewards, you can still include it as a variable formally. but there’s no reason to make that variable anything less than the episode length. (If the horizon is n steps lower than the episode length, this just sets the value function identically to 0 for the first n steps.)
I guess I was using “there isn’t a horizon per se” to mean “the time structure of the rewards determines the horizon for you, it wouldn’t make sense to vary it,” but I can see how that would be confusing.
If you only set the horizon to 1 but changed nothing else in their work, you’d get a dumb policy that equals the initial LM until the very last token, which it treats as responsible for the entire reward. If you add intermediate rewards and set the horizon to 1, you get something more sensible.
Ah got it, that makes sense, I agree with all of that.