I don’t think this is right—the main hype effect of chatGPT over previous models feels like it’s just because it was in a convenient chat interface that was easy to use and free.
I don’t have extensive relevant expertise, but as a personal datapoint: I used Davinci-002 multiple times to generate an interesting dialogue in order to test its capabilities. I ran several small-scale Turing tests, and the results were quite unimpressive in my opinion. When ChatGPT came out, I tried it out (on the day of its release) and very quickly felt that it was qualitatively better at dialogue. Of course, I could have simply been prompting Davinci-002 poorly, but overall I’m quite skeptical that the main reason for ChatGPT hype was that it had a more convenient chat interface than GPT-3.
That makes sense. However, Davinci-003 came out just a few days prior to ChatGPT. The relevant transition was from Davinci-002 to Davinci-003/ChatGPT.
Yep, and text-davinci-002 was trained with supervised finetuning / written demos, while 003 was trained with RLHF via PPO. Hypothetically, the clearest illustration of RLHF’s capabilities gains should be from comparing 002 to 003. However, OpenAI could have also used other methods to improve 003, such as with Transcending Scaling Laws with 0.1% Extra Compute.
Our models generally used the best available datasets at the time of training, and so different engines using the same training methodology might be trained on different data.
So I guess 003 could also have different base pretraining data?
[edit: this says the same thing as Quintin’s sibling comment]
Important context for those who don’t know it: the main difference between text-davinci-002 and text-davinci-003 is that the latter was trained with PPO against a reward model, i.e. RLHF as laid out in the InstructGPT paper. (Source: OpenAI model index.)
In more detail, text-davinci-002 seems to have been trained via supervised fune-tuning on the model outputs which were rated highest by human reviewers (this is what the model index calls FeedME). The model index only says that text-davinci-003 was trained via PPO against a reward model, but this was after SFT on human demonstrations, and might have also been after FeedME training.
(Aside: the terminology “RLHF” is starting to become confusing, as some people use it narrowly to mean “PPO against a reward model” and others use it more broadly to mean “using any RL technique with a reward signal given by human reviewers,” which would include FeedME.)
The terminology “RLHF” is starting to become confusing, as some people use it narrowly to mean “PPO against a reward model” and others use it more broadly to mean “using any RL technique with a reward signal given by human reviewers,” which would include FeedME.
Sorry for getting off track, but I thought FeedME did not use RL on the final model, only supervised training? Or do you just mean that the FeedME-trained models may have been fed inputs from models that had been RL-finetuned (namely the one from the InstructGPT paper)? Not sure if OpenAI said anywhere whether the latter was the case, or whether FeedME just uses inputs from non-RL models.
This is just a terminological difference: supervised fine-tuning on highly rated outputs is a type of RL. (At least according to how many people use the term.)
Got a source for that? This seems like an odd way to use the term, in particular because with supervised fine-tuning there’s no credit assignment over time, and so it doesn’t train the model to actually aim towards high-reward states.
To be clear, I’m not classifying all uses of SFT as RL (for example, I would not call SFT on human expert demonstrations RL). It’s specifically SFT on highly-rated model outputs—i.e. having the model produce a bunch of rollouts, labeling them with rewards, training the model to imitate the top-rewarded rollouts, and repeating—which I’m calling RL here. Note that this training process does aim the model towards high-reward, and is very similar to the online decision transformer, which is typically classed as an RL technique.
So I still feel that the way I used the term “RL” was in line with normal usage. But if people still disagree now that I’ve explained myself in more detail, I’d be interested in hearing why.
Two central features of RL in my mind, which distinguish it from imitation learning:
Receiving reward in a given state make the policy more likely to navigate to that state in general (not just via the specific pathway in which it happened to reach that state) - i.e. there’s efficient credit assignment through time.
(In theory) small differences in reward can lead to big differences in behavior, i.e. there’s mode collapse to the highest-expected-reward policy.
Q-learning is therefore a central example of RL, alongside actor-critic algorithms.
Online REINFORCE has very dumb credit assignment, but it does eventually leads to mode collapse to highest-expected-reward policy. So I count this as… like 75% RL, but a less central example than Q-learning.
Online high-rated SFT also has poor credit assignment, in a similar way as online REINFORCE. Meanwhile, whether or not it converges to the highest-reward policy depends on how the ratings are generated. If there’s a bucket of high-reward trajectories such that all sufficiently-good trajectories go in there, then it’ll never learn to do better than a typical trajectory from that bucket. This feels more like online imitation learning (e.g. stuff like DAgger) which people don’t call RL.
By contrast, if there’s an underlying “true” reward function and the probability that a trajectory is highly-rated depends (monotonically) on its true reward, then eventually it’ll converge to only ever taking the highest-reward trajectories, which feels more centrally RL to me.
Idk how much sense this makes, it all feels a bit fuzzy. My immediate conclusion is that we should mostly care about the three traits of “online”, “state-wise credit assignment” and “converges to sharp optimum” separately, rather than trying to figure out which combination of them counts as RL (except that anything with state-wise credit assignment is definitely RL).
I appreciate your clear articulation of the point about incentivizing the agent to navigate to high-reward states in a trajectory-independent way (in contrast to learning to produce trajectories like those which historically got high reward). That said, I’m confused about how you’ve labeled the methods you mention as having vs. not having this property.
To make sure we’re on the same page, suppose we’re in an environment with a state s∗ which is high reward, and suppose that there are two ways to get to state s∗: via the two trajectories (s,a,s∗) and (s′,a′,s∗). Suppose further that historically the agent has only navigated to this state via the former trajectory (s,a,s∗).
I agree that if the agent was trained via REINFORCE and finds itself in state s′ that it might not know to take action a′ (because it’s only been reinforced to take action a from state s, and not to reach state s∗; and also because it might not know that a′ would transition it to state s∗).
But this also seems true if the agent were trained via Q-learning with a Q-function Q(s,a): the Q-function need not have learned that Q(s′,a′) is large, only that Q(s,a) is large.
In either the REINFORCE or the Q-learning case, once the agent sees a trajectory (s′,a′,s∗), it will make an update towards taking action a′ from state s′, but the size of the update seems to depend on details about the network implementing the policy or Q-function—if there’s some obvious reason that the Q-learner will necessarily make a larger update, I’ve missed it.
I think the above also applies in the case of actor-critic methods where the critic is implemented by a Q-function. And I think it still applies even if the critic is a value function V(s), but I’m less confident: the critic has the assumption baked in that rewards come only from states, but the actor still doesn’t, so this might have similar dynamics to REINFORCE. (And if it ends up that this does do better, it’s only by baking in an assumption about the environment—that rewards come from the states and not the specific trajectories—which isn’t true in all environments.)
So I don’t follow why Q-learning and actor-critic methods on one hand, and REINFORCE and FeedME on the other hand, lie on opposite sides of the “learn to navigate to high-reward states in a trajectory-independent way” spectrum.
(I enjoyed thinking through the details here, by the way, so thanks for prompting that.)
I think your example is too simple to capture the relevant phenomenon. Here’s one which does: suppose state s3 gives high reward, state s4 gives medium reward, and state s5 gives low reward. You’ve seen the following trajectories:
s2 → s3
s1 → s4
s1 → s2 → s5
Then q-learning will learn quickly that it should go s1 → s2 → s3, whereas REINFORCE and SFT will need to do further exploration before learning that.
I feel uncertain about how to think about the implications of this claim in the context of more complex environments, though. In some sense it only happens because q-learning is doing a one-step lookahead, which isn’t really scalable. (That also isn’t true of all critics.)
It feels like I might have just come up with a new name for “RL algorithms which work on offline data”, which is presumably not a crucial distinction.
Ah, nice example! I now see your point, and I agree with everything you wrote. Whereas REINFORCE and SFT only incentivize actions which in fact were historically part of high-reward trajectories, Q-learning and actor-critic incentivize actions which comprise trajectories that one can infer would be high-reward (even if those actions never actually appeared in high-reward trajectories previously).
To throw in another perspective, I’ve been working with the OpenAI API models most days of the week for the past year or so. For my uses, the step-change in quality came from moving from base davinci to text-davinci-002, whereas the improvements moving from that to text-davinci-003 were decidedly less clear.
I agree the difference between base and 002 is bigger than the difference between 002 and 003. The base model needs to be carefully coaxed into a scenario where plausible continuations of the prompt align with your intended output, and even then it’s very inclined to repeat stuff and degenerates quickly. By contrast, you can just tell 002 what to do, and it will usually at least try to do what you say.
Seems like you’re implying that davinci is the base model for 002 and 003. That’s not the case; davinci has one base model (GPT-3) and then 002 and 003 share a different base model (GPT-3.5).
Fair. I think the crucial question to Ajeya & Matthew’s discussion of “Why the hype now?” is exactly how much worse the non-RLHF models that had been available since at least last March (davinci, code-davinci-002, text-davinci-002) actually were than the RLHF models made available just recently (text-davinci-003 and ChatGPT’s underlying model). I stand by the opinion that the besides the new chat stuff, most of the improvement happened within the old cohort, rather than between cohorts, so I attribute the recent hype to the convenient and free chat interface.
I don’t have extensive relevant expertise, but as a personal datapoint: I used Davinci-002 multiple times to generate an interesting dialogue in order to test its capabilities. I ran several small-scale Turing tests, and the results were quite unimpressive in my opinion. When ChatGPT came out, I tried it out (on the day of its release) and very quickly felt that it was qualitatively better at dialogue. Of course, I could have simply been prompting Davinci-002 poorly, but overall I’m quite skeptical that the main reason for ChatGPT hype was that it had a more convenient chat interface than GPT-3.
I’ve felt that ChatGPT was roughly on par with text-davinci-003, though much more annoying and with a worse interface.
That makes sense. However, Davinci-003 came out just a few days prior to ChatGPT. The relevant transition was from Davinci-002 to Davinci-003/ChatGPT.
Yep, and text-davinci-002 was trained with supervised finetuning / written demos, while 003 was trained with RLHF via PPO. Hypothetically, the clearest illustration of RLHF’s capabilities gains should be from comparing 002 to 003. However, OpenAI could have also used other methods to improve 003, such as with Transcending Scaling Laws with 0.1% Extra Compute.
This page also says that:
So I guess 003 could also have different base pretraining data?
[edit: this says the same thing as Quintin’s sibling comment]
Important context for those who don’t know it: the main difference between text-davinci-002 and text-davinci-003 is that the latter was trained with PPO against a reward model, i.e. RLHF as laid out in the InstructGPT paper. (Source: OpenAI model index.)
In more detail, text-davinci-002 seems to have been trained via supervised fune-tuning on the model outputs which were rated highest by human reviewers (this is what the model index calls FeedME). The model index only says that text-davinci-003 was trained via PPO against a reward model, but this was after SFT on human demonstrations, and might have also been after FeedME training.
(Aside: the terminology “RLHF” is starting to become confusing, as some people use it narrowly to mean “PPO against a reward model” and others use it more broadly to mean “using any RL technique with a reward signal given by human reviewers,” which would include FeedME.)
Sorry for getting off track, but I thought FeedME did not use RL on the final model, only supervised training? Or do you just mean that the FeedME-trained models may have been fed inputs from models that had been RL-finetuned (namely the one from the InstructGPT paper)? Not sure if OpenAI said anywhere whether the latter was the case, or whether FeedME just uses inputs from non-RL models.
This is just a terminological difference: supervised fine-tuning on highly rated outputs is a type of RL. (At least according to how many people use the term.)
Got a source for that? This seems like an odd way to use the term, in particular because with supervised fine-tuning there’s no credit assignment over time, and so it doesn’t train the model to actually aim towards high-reward states.
To be clear, I’m not classifying all uses of SFT as RL (for example, I would not call SFT on human expert demonstrations RL). It’s specifically SFT on highly-rated model outputs—i.e. having the model produce a bunch of rollouts, labeling them with rewards, training the model to imitate the top-rewarded rollouts, and repeating—which I’m calling RL here. Note that this training process does aim the model towards high-reward, and is very similar to the online decision transformer, which is typically classed as an RL technique.
So I still feel that the way I used the term “RL” was in line with normal usage. But if people still disagree now that I’ve explained myself in more detail, I’d be interested in hearing why.
Two central features of RL in my mind, which distinguish it from imitation learning:
Receiving reward in a given state make the policy more likely to navigate to that state in general (not just via the specific pathway in which it happened to reach that state) - i.e. there’s efficient credit assignment through time.
(In theory) small differences in reward can lead to big differences in behavior, i.e. there’s mode collapse to the highest-expected-reward policy.
Q-learning is therefore a central example of RL, alongside actor-critic algorithms.
Online REINFORCE has very dumb credit assignment, but it does eventually leads to mode collapse to highest-expected-reward policy. So I count this as… like 75% RL, but a less central example than Q-learning.
Online high-rated SFT also has poor credit assignment, in a similar way as online REINFORCE. Meanwhile, whether or not it converges to the highest-reward policy depends on how the ratings are generated. If there’s a bucket of high-reward trajectories such that all sufficiently-good trajectories go in there, then it’ll never learn to do better than a typical trajectory from that bucket. This feels more like online imitation learning (e.g. stuff like DAgger) which people don’t call RL.
By contrast, if there’s an underlying “true” reward function and the probability that a trajectory is highly-rated depends (monotonically) on its true reward, then eventually it’ll converge to only ever taking the highest-reward trajectories, which feels more centrally RL to me.
Idk how much sense this makes, it all feels a bit fuzzy. My immediate conclusion is that we should mostly care about the three traits of “online”, “state-wise credit assignment” and “converges to sharp optimum” separately, rather than trying to figure out which combination of them counts as RL (except that anything with state-wise credit assignment is definitely RL).
I appreciate your clear articulation of the point about incentivizing the agent to navigate to high-reward states in a trajectory-independent way (in contrast to learning to produce trajectories like those which historically got high reward). That said, I’m confused about how you’ve labeled the methods you mention as having vs. not having this property.
To make sure we’re on the same page, suppose we’re in an environment with a state s∗ which is high reward, and suppose that there are two ways to get to state s∗: via the two trajectories (s,a,s∗) and (s′,a′,s∗). Suppose further that historically the agent has only navigated to this state via the former trajectory (s,a,s∗).
I agree that if the agent was trained via REINFORCE and finds itself in state s′ that it might not know to take action a′ (because it’s only been reinforced to take action a from state s, and not to reach state s∗; and also because it might not know that a′ would transition it to state s∗).
But this also seems true if the agent were trained via Q-learning with a Q-function Q(s,a): the Q-function need not have learned that Q(s′,a′) is large, only that Q(s,a) is large.
In either the REINFORCE or the Q-learning case, once the agent sees a trajectory (s′,a′,s∗), it will make an update towards taking action a′ from state s′, but the size of the update seems to depend on details about the network implementing the policy or Q-function—if there’s some obvious reason that the Q-learner will necessarily make a larger update, I’ve missed it.
I think the above also applies in the case of actor-critic methods where the critic is implemented by a Q-function. And I think it still applies even if the critic is a value function V(s), but I’m less confident: the critic has the assumption baked in that rewards come only from states, but the actor still doesn’t, so this might have similar dynamics to REINFORCE. (And if it ends up that this does do better, it’s only by baking in an assumption about the environment—that rewards come from the states and not the specific trajectories—which isn’t true in all environments.)
So I don’t follow why Q-learning and actor-critic methods on one hand, and REINFORCE and FeedME on the other hand, lie on opposite sides of the “learn to navigate to high-reward states in a trajectory-independent way” spectrum.
(I enjoyed thinking through the details here, by the way, so thanks for prompting that.)
I think your example is too simple to capture the relevant phenomenon. Here’s one which does: suppose state s3 gives high reward, state s4 gives medium reward, and state s5 gives low reward. You’ve seen the following trajectories:
s2 → s3
s1 → s4
s1 → s2 → s5
Then q-learning will learn quickly that it should go s1 → s2 → s3, whereas REINFORCE and SFT will need to do further exploration before learning that.
I feel uncertain about how to think about the implications of this claim in the context of more complex environments, though. In some sense it only happens because q-learning is doing a one-step lookahead, which isn’t really scalable. (That also isn’t true of all critics.)
It feels like I might have just come up with a new name for “RL algorithms which work on offline data”, which is presumably not a crucial distinction.
Ah, nice example! I now see your point, and I agree with everything you wrote. Whereas REINFORCE and SFT only incentivize actions which in fact were historically part of high-reward trajectories, Q-learning and actor-critic incentivize actions which comprise trajectories that one can infer would be high-reward (even if those actions never actually appeared in high-reward trajectories previously).
Flagging that I would find that use of the term super confusing.
To throw in another perspective, I’ve been working with the OpenAI API models most days of the week for the past year or so. For my uses, the step-change in quality came from moving from base
davinci
totext-davinci-002
, whereas the improvements moving from that totext-davinci-003
were decidedly less clear.I agree the difference between base and 002 is bigger than the difference between 002 and 003. The base model needs to be carefully coaxed into a scenario where plausible continuations of the prompt align with your intended output, and even then it’s very inclined to repeat stuff and degenerates quickly. By contrast, you can just tell 002 what to do, and it will usually at least try to do what you say.
Seems like you’re implying that davinci is the base model for 002 and 003. That’s not the case; davinci has one base model (GPT-3) and then 002 and 003 share a different base model (GPT-3.5).
Fair. I think the crucial question to Ajeya & Matthew’s discussion of “Why the hype now?” is exactly how much worse the non-RLHF models that had been available since at least last March (
davinci
,code-davinci-002
,text-davinci-002
) actually were than the RLHF models made available just recently (text-davinci-003
and ChatGPT’s underlying model). I stand by the opinion that the besides the new chat stuff, most of the improvement happened within the old cohort, rather than between cohorts, so I attribute the recent hype to the convenient and free chat interface.