[edit: this says the same thing as Quintin’s sibling comment]
Important context for those who don’t know it: the main difference between text-davinci-002 and text-davinci-003 is that the latter was trained with PPO against a reward model, i.e. RLHF as laid out in the InstructGPT paper. (Source: OpenAI model index.)
In more detail, text-davinci-002 seems to have been trained via supervised fune-tuning on the model outputs which were rated highest by human reviewers (this is what the model index calls FeedME). The model index only says that text-davinci-003 was trained via PPO against a reward model, but this was after SFT on human demonstrations, and might have also been after FeedME training.
(Aside: the terminology “RLHF” is starting to become confusing, as some people use it narrowly to mean “PPO against a reward model” and others use it more broadly to mean “using any RL technique with a reward signal given by human reviewers,” which would include FeedME.)
The terminology “RLHF” is starting to become confusing, as some people use it narrowly to mean “PPO against a reward model” and others use it more broadly to mean “using any RL technique with a reward signal given by human reviewers,” which would include FeedME.
Sorry for getting off track, but I thought FeedME did not use RL on the final model, only supervised training? Or do you just mean that the FeedME-trained models may have been fed inputs from models that had been RL-finetuned (namely the one from the InstructGPT paper)? Not sure if OpenAI said anywhere whether the latter was the case, or whether FeedME just uses inputs from non-RL models.
This is just a terminological difference: supervised fine-tuning on highly rated outputs is a type of RL. (At least according to how many people use the term.)
Got a source for that? This seems like an odd way to use the term, in particular because with supervised fine-tuning there’s no credit assignment over time, and so it doesn’t train the model to actually aim towards high-reward states.
To be clear, I’m not classifying all uses of SFT as RL (for example, I would not call SFT on human expert demonstrations RL). It’s specifically SFT on highly-rated model outputs—i.e. having the model produce a bunch of rollouts, labeling them with rewards, training the model to imitate the top-rewarded rollouts, and repeating—which I’m calling RL here. Note that this training process does aim the model towards high-reward, and is very similar to the online decision transformer, which is typically classed as an RL technique.
So I still feel that the way I used the term “RL” was in line with normal usage. But if people still disagree now that I’ve explained myself in more detail, I’d be interested in hearing why.
Two central features of RL in my mind, which distinguish it from imitation learning:
Receiving reward in a given state make the policy more likely to navigate to that state in general (not just via the specific pathway in which it happened to reach that state) - i.e. there’s efficient credit assignment through time.
(In theory) small differences in reward can lead to big differences in behavior, i.e. there’s mode collapse to the highest-expected-reward policy.
Q-learning is therefore a central example of RL, alongside actor-critic algorithms.
Online REINFORCE has very dumb credit assignment, but it does eventually leads to mode collapse to highest-expected-reward policy. So I count this as… like 75% RL, but a less central example than Q-learning.
Online high-rated SFT also has poor credit assignment, in a similar way as online REINFORCE. Meanwhile, whether or not it converges to the highest-reward policy depends on how the ratings are generated. If there’s a bucket of high-reward trajectories such that all sufficiently-good trajectories go in there, then it’ll never learn to do better than a typical trajectory from that bucket. This feels more like online imitation learning (e.g. stuff like DAgger) which people don’t call RL.
By contrast, if there’s an underlying “true” reward function and the probability that a trajectory is highly-rated depends (monotonically) on its true reward, then eventually it’ll converge to only ever taking the highest-reward trajectories, which feels more centrally RL to me.
Idk how much sense this makes, it all feels a bit fuzzy. My immediate conclusion is that we should mostly care about the three traits of “online”, “state-wise credit assignment” and “converges to sharp optimum” separately, rather than trying to figure out which combination of them counts as RL (except that anything with state-wise credit assignment is definitely RL).
I appreciate your clear articulation of the point about incentivizing the agent to navigate to high-reward states in a trajectory-independent way (in contrast to learning to produce trajectories like those which historically got high reward). That said, I’m confused about how you’ve labeled the methods you mention as having vs. not having this property.
To make sure we’re on the same page, suppose we’re in an environment with a state s∗ which is high reward, and suppose that there are two ways to get to state s∗: via the two trajectories (s,a,s∗) and (s′,a′,s∗). Suppose further that historically the agent has only navigated to this state via the former trajectory (s,a,s∗).
I agree that if the agent was trained via REINFORCE and finds itself in state s′ that it might not know to take action a′ (because it’s only been reinforced to take action a from state s, and not to reach state s∗; and also because it might not know that a′ would transition it to state s∗).
But this also seems true if the agent were trained via Q-learning with a Q-function Q(s,a): the Q-function need not have learned that Q(s′,a′) is large, only that Q(s,a) is large.
In either the REINFORCE or the Q-learning case, once the agent sees a trajectory (s′,a′,s∗), it will make an update towards taking action a′ from state s′, but the size of the update seems to depend on details about the network implementing the policy or Q-function—if there’s some obvious reason that the Q-learner will necessarily make a larger update, I’ve missed it.
I think the above also applies in the case of actor-critic methods where the critic is implemented by a Q-function. And I think it still applies even if the critic is a value function V(s), but I’m less confident: the critic has the assumption baked in that rewards come only from states, but the actor still doesn’t, so this might have similar dynamics to REINFORCE. (And if it ends up that this does do better, it’s only by baking in an assumption about the environment—that rewards come from the states and not the specific trajectories—which isn’t true in all environments.)
So I don’t follow why Q-learning and actor-critic methods on one hand, and REINFORCE and FeedME on the other hand, lie on opposite sides of the “learn to navigate to high-reward states in a trajectory-independent way” spectrum.
(I enjoyed thinking through the details here, by the way, so thanks for prompting that.)
I think your example is too simple to capture the relevant phenomenon. Here’s one which does: suppose state s3 gives high reward, state s4 gives medium reward, and state s5 gives low reward. You’ve seen the following trajectories:
s2 → s3
s1 → s4
s1 → s2 → s5
Then q-learning will learn quickly that it should go s1 → s2 → s3, whereas REINFORCE and SFT will need to do further exploration before learning that.
I feel uncertain about how to think about the implications of this claim in the context of more complex environments, though. In some sense it only happens because q-learning is doing a one-step lookahead, which isn’t really scalable. (That also isn’t true of all critics.)
It feels like I might have just come up with a new name for “RL algorithms which work on offline data”, which is presumably not a crucial distinction.
Ah, nice example! I now see your point, and I agree with everything you wrote. Whereas REINFORCE and SFT only incentivize actions which in fact were historically part of high-reward trajectories, Q-learning and actor-critic incentivize actions which comprise trajectories that one can infer would be high-reward (even if those actions never actually appeared in high-reward trajectories previously).
[edit: this says the same thing as Quintin’s sibling comment]
Important context for those who don’t know it: the main difference between text-davinci-002 and text-davinci-003 is that the latter was trained with PPO against a reward model, i.e. RLHF as laid out in the InstructGPT paper. (Source: OpenAI model index.)
In more detail, text-davinci-002 seems to have been trained via supervised fune-tuning on the model outputs which were rated highest by human reviewers (this is what the model index calls FeedME). The model index only says that text-davinci-003 was trained via PPO against a reward model, but this was after SFT on human demonstrations, and might have also been after FeedME training.
(Aside: the terminology “RLHF” is starting to become confusing, as some people use it narrowly to mean “PPO against a reward model” and others use it more broadly to mean “using any RL technique with a reward signal given by human reviewers,” which would include FeedME.)
Sorry for getting off track, but I thought FeedME did not use RL on the final model, only supervised training? Or do you just mean that the FeedME-trained models may have been fed inputs from models that had been RL-finetuned (namely the one from the InstructGPT paper)? Not sure if OpenAI said anywhere whether the latter was the case, or whether FeedME just uses inputs from non-RL models.
This is just a terminological difference: supervised fine-tuning on highly rated outputs is a type of RL. (At least according to how many people use the term.)
Got a source for that? This seems like an odd way to use the term, in particular because with supervised fine-tuning there’s no credit assignment over time, and so it doesn’t train the model to actually aim towards high-reward states.
To be clear, I’m not classifying all uses of SFT as RL (for example, I would not call SFT on human expert demonstrations RL). It’s specifically SFT on highly-rated model outputs—i.e. having the model produce a bunch of rollouts, labeling them with rewards, training the model to imitate the top-rewarded rollouts, and repeating—which I’m calling RL here. Note that this training process does aim the model towards high-reward, and is very similar to the online decision transformer, which is typically classed as an RL technique.
So I still feel that the way I used the term “RL” was in line with normal usage. But if people still disagree now that I’ve explained myself in more detail, I’d be interested in hearing why.
Two central features of RL in my mind, which distinguish it from imitation learning:
Receiving reward in a given state make the policy more likely to navigate to that state in general (not just via the specific pathway in which it happened to reach that state) - i.e. there’s efficient credit assignment through time.
(In theory) small differences in reward can lead to big differences in behavior, i.e. there’s mode collapse to the highest-expected-reward policy.
Q-learning is therefore a central example of RL, alongside actor-critic algorithms.
Online REINFORCE has very dumb credit assignment, but it does eventually leads to mode collapse to highest-expected-reward policy. So I count this as… like 75% RL, but a less central example than Q-learning.
Online high-rated SFT also has poor credit assignment, in a similar way as online REINFORCE. Meanwhile, whether or not it converges to the highest-reward policy depends on how the ratings are generated. If there’s a bucket of high-reward trajectories such that all sufficiently-good trajectories go in there, then it’ll never learn to do better than a typical trajectory from that bucket. This feels more like online imitation learning (e.g. stuff like DAgger) which people don’t call RL.
By contrast, if there’s an underlying “true” reward function and the probability that a trajectory is highly-rated depends (monotonically) on its true reward, then eventually it’ll converge to only ever taking the highest-reward trajectories, which feels more centrally RL to me.
Idk how much sense this makes, it all feels a bit fuzzy. My immediate conclusion is that we should mostly care about the three traits of “online”, “state-wise credit assignment” and “converges to sharp optimum” separately, rather than trying to figure out which combination of them counts as RL (except that anything with state-wise credit assignment is definitely RL).
I appreciate your clear articulation of the point about incentivizing the agent to navigate to high-reward states in a trajectory-independent way (in contrast to learning to produce trajectories like those which historically got high reward). That said, I’m confused about how you’ve labeled the methods you mention as having vs. not having this property.
To make sure we’re on the same page, suppose we’re in an environment with a state s∗ which is high reward, and suppose that there are two ways to get to state s∗: via the two trajectories (s,a,s∗) and (s′,a′,s∗). Suppose further that historically the agent has only navigated to this state via the former trajectory (s,a,s∗).
I agree that if the agent was trained via REINFORCE and finds itself in state s′ that it might not know to take action a′ (because it’s only been reinforced to take action a from state s, and not to reach state s∗; and also because it might not know that a′ would transition it to state s∗).
But this also seems true if the agent were trained via Q-learning with a Q-function Q(s,a): the Q-function need not have learned that Q(s′,a′) is large, only that Q(s,a) is large.
In either the REINFORCE or the Q-learning case, once the agent sees a trajectory (s′,a′,s∗), it will make an update towards taking action a′ from state s′, but the size of the update seems to depend on details about the network implementing the policy or Q-function—if there’s some obvious reason that the Q-learner will necessarily make a larger update, I’ve missed it.
I think the above also applies in the case of actor-critic methods where the critic is implemented by a Q-function. And I think it still applies even if the critic is a value function V(s), but I’m less confident: the critic has the assumption baked in that rewards come only from states, but the actor still doesn’t, so this might have similar dynamics to REINFORCE. (And if it ends up that this does do better, it’s only by baking in an assumption about the environment—that rewards come from the states and not the specific trajectories—which isn’t true in all environments.)
So I don’t follow why Q-learning and actor-critic methods on one hand, and REINFORCE and FeedME on the other hand, lie on opposite sides of the “learn to navigate to high-reward states in a trajectory-independent way” spectrum.
(I enjoyed thinking through the details here, by the way, so thanks for prompting that.)
I think your example is too simple to capture the relevant phenomenon. Here’s one which does: suppose state s3 gives high reward, state s4 gives medium reward, and state s5 gives low reward. You’ve seen the following trajectories:
s2 → s3
s1 → s4
s1 → s2 → s5
Then q-learning will learn quickly that it should go s1 → s2 → s3, whereas REINFORCE and SFT will need to do further exploration before learning that.
I feel uncertain about how to think about the implications of this claim in the context of more complex environments, though. In some sense it only happens because q-learning is doing a one-step lookahead, which isn’t really scalable. (That also isn’t true of all critics.)
It feels like I might have just come up with a new name for “RL algorithms which work on offline data”, which is presumably not a crucial distinction.
Ah, nice example! I now see your point, and I agree with everything you wrote. Whereas REINFORCE and SFT only incentivize actions which in fact were historically part of high-reward trajectories, Q-learning and actor-critic incentivize actions which comprise trajectories that one can infer would be high-reward (even if those actions never actually appeared in high-reward trajectories previously).
Flagging that I would find that use of the term super confusing.