RLHF is just not that important to the bottom line right now. Imitation learning works nearly as well, other hacky techniques can do quite a lot to fix obvious problems, and the whole issue is mostly second order for the current bottom line.
I am very confused why you think this, just right after the success of Chat-GPT, where approximately the only difference from GPT-3 was the presence of RLHF.
My current best guess is that Chat-GPT alone, via sparking an arms-race between Google and Microsoft, and by increasing OpenAIs valuation, should be modeled as the equivalent of something on the order of $10B of investment into AI capabilities research, completely in addition to the gains from GPT-3.
And my guess is most of that success is attributable to the work on RLHF, since that was really the only substantial difference between Chat-GPT and GPT-3. We also should not think this was overdetermined since 1.5 years passed since the release of GPT-3 and the release of Chat-GPT (with some updates to GPT-3 in the meantime, but my guess is no major ones), and no other research lab focused on capabilities had set up their own RLHF pipeline (except Anthropic, which I don’t think makes sense to use as a datapoint here, since it’s in substantial parts the same employees).
I have been trying to engage with the actual details here, and indeed have had a bunch of arguments with people over the last 2 years where I have been explicitly saying that RLHF is pushing on commercialization bottlenecks based on those details, and people believing this was not the case was the primary crux on whether RLHF was good or bad in those conversations.
The crux was importantly not that other people would do the same work anyways, since people at the same time also argued that their work on RLHF was counterfactually relevant and that it’s pretty plausible or likely that the work would otherwise not happen. I’ve had a few of these conversations with you as well (though in aggregate not a lot) and your take at the time was (IIRC) that it seemed quite unlikely that RLHF would have as big of an effect as it did have in the case of Chat-GPT (mostly via an efficiency argument that if that was the case, more capabilities-oriented people would work on it, and since they weren’t it likely isn’t a commercialization bottleneck), and so I do feel a bit like I want to call you out on that, though I might also be misremembering the details (some of this was online, so might be worth going back through our comment histories).
I am very confused why you think this, just right after the success of Chat-GPT, where approximately the only difference from GPT-3 was the presence of RLHF.
I think the qualitative difference between the supervised tuning done in text-davinci-002 and the RLHF in text-davinci-003 is modest (e.g. I’ve seen head-to-head comparisons suggesting real but modest effects on similar tasks).
I think the much more important differences are:
It was trained to interact directly with the end user as a conversational assistant rather than in an API intended to be used by developers.
It was deployed in a way that made it much easier for more people to interact with it.
People hadn’t appreciated progress since GPT-3, or even how good GPT-3 was, and this went viral (due to a combination of 1+2).
If there are large capability differences I expect they are mostly orthogonal improvements.
I think the effect would have been very similar if it had been trained via supervised learning on good dialogs.
My current best guess is that Chat-GPT alone, via sparking an arms-race between Google and Microsoft, and by increasing OpenAIs valuation, should be modeled as the equivalent of something on the order of $10B of investment into AI capabilities research, completely in addition to the gains from GPT-3.
ChatGPT was impactful because of a big mismatch between people’s perceptions of LM abilities and reality. That gap was going to get closed sooner or later (if not now then probably at the GPT-4 release). I think it’s reasonable to think that this was a really destructive decision by OpenAI, but I don’t think it’s reasonable to treat it as a counterfactual $10B of investment.
I feel like the implicit model of the world you are using here is going to have effect sizes adding up to much more than the actual variance at stake. How impactful was the existence of OpenAI? Leadership decisions at Google? Microsoft’s willingness to invest in OpenAI? The surprising effectiveness of transformers? Google originally deciding not to scale up LMs aggressively? The training of PaLM? The original GPT-3 release decisions? The fact that LM startups are raising at billion dollar valuations? The fact that LM applications are making hundreds of millions of dollars? These sources of variance all add up to 100% of the variance in AI investment, not 100000% of the variance.
I think it’s a persistent difference between us that I tend to think fundamentals matter more and you tend to think things are more contingent and random. I tend to find your causal attribution implausible in other technologies as well as AI.
We also should not think this was overdetermined since 1.5 years passed since the release of GPT-3 and the release of Chat-GPT (with some updates to GPT-3 in the meantime, but my guess is no major ones)
There were significant capability increases between GPT-3 an GPT-3.5 (not to mention the introduction of the earlier InstructGPT training).
The crux was importantly not that other people would do the same work anyways, since people at the same time also argued that their work on RLHF was counterfactually relevant and that it’s pretty plausible or likely that the work would otherwise not happen. I’ve had a few of these conversations with you as well (though in aggregate not a lot) and your take at the time was (IIRC) that it seemed quite unlikely that RLHF would have as big of an effect as it did have in the case of Chat-GPT (mostly via an efficiency argument that if that was the case, more capabilities-oriented people would work on it, and since they weren’t it likely isn’t a commercialization bottleneck), and so I do feel a bit like I want to call you out on that, though I might also be misremembering the details (some of this was online, so might be worth going back through our comment histories).
My position was and is:
RLHF was definitely going to be done sooner or later. (I’ve definitely never thought that RLHF would never happen.)
It’s valuable to do it earlier to get started on the next thing. It’s also good to push people to something cleaner and more flexible rather than something more hacky or with no knob to change the reward function.
We were doing it before it was a big deal commercially; it would have got done later when it mattered.
To be clear, sample efficiency might be high enough later that you just use the AI’s zero-shot predictions of humans instead of collecting any new specialized data, which we also discussed specifically at the time.
I’m pretty skeptical that no one else would do RLHF. For ChatGPT in particular, I think it was built by John Schulman’s team, and John is: (i) focused on RL, (ii) pivoted to LMs after the success of GPT-3 relative to non-LM models and would have done so without RLHF, (iii) has a similar aesthetic and would pretty obviously do this or something else equally good.
I think the most likely world where people don’t adopt RLHF is one where other hackier alternatives work just as well. And it won’t be from no one trying.
I think the big argument against impact I find most compelling is: most follow-up work to RLHF didn’t work that well for GPT-3 and seem to have started working after that, so you could have just waited until people would do it anyway and in the interim focused on approaches that work better at smaller scale. I think the big miscalculation here was that I expected debate/decomposition stuff would start working interestingly with curie-sized models but was off by about 2 orders of magnitude.
I think the big argument for negative impact comes from safety-motivated folk being involved in training language models, not the RLHF stuff. I also disagree with the rationalists about their evaluations of pretty much everything, but that one feels like a more interesting disagreement.
I think the effect would have been very similar if it had been trained via supervised learning on good dialogs
I don’t currently think this is the case, and seems like the likely crux. In general it seems that RLHF is substantially more flexible in what kind of target task it allows you to train for, which is the whole reason for why you are working on it, and at least my model of the difficulty of generating good training data for supervised learning here is that it would have been a much greater pain, and would have been much harder to control in various fine-grained ways (including preventing the AI from saying controversial things), which had been the biggest problem with previous chat bot attempts.
For ChatGPT in particular, I think it was built by John Schulman’s team
I find a comparison with John Schulman here unimpressive if you want to argue progress on this was overdetermined, given the safety motivation by John, and my best guess being that if you had argued forcefully that RLHF was pushing on commercialization bottlenecks, that John would have indeed not worked on it.
Seeing RLHF teams in other organizations not directly downstream of your organizational involvement, or not quite directly entangled with your opinion, would make a bigger difference here.
I feel like the implicit model of the world you are using here is going to have effect sizes adding up to much more than the actual variance at stake
I don’t think so, and have been trying to be quite careful about this. Chat-GPT is just by far the most successful AI product to date, with by far the biggest global impact on AI investment and the most hype. I think $10B being downstream of that isn’t that crazy. The product has a user base not that different from other $10B products, and a growth rate to put basically all of them to shame, so I don’t think a $10B effect from Chat-GPT seems that unreasonable. There is only so much variance to go around, but Chat-GPT is absolutely massive in its impact.
I don’t currently think this is the case, and seems like the likely crux. In-general it seems that RLHF is substantially more flexible in what kind of target task it allows you to train, which is the whole reason for why you are working on it, and at least my model of the difficulty of generating good training data for supervised learning here is that it would have been a much greater pain, and would have been much harder to control in various fine-tuned ways (including preventing the AI from saying controversial things), which had been the biggest problem with previous chat bot attempts.
I bet they did generate supervised data (certainly they do for InstructGPT), and supervised data seems way more fine-grained in what you are getting the AI to do. It’s just that supervised fine-tuning is worse.
I think the biggest problem with previous chat-bot attempts is that the underlying models are way way weaker than GPT-3.5.
I don’t think so, and have been trying to be quite careful about this. Chat-GPT is just by far the most successful AI product to date, with by far the biggest global impact on AI investment and the most hype. I think $10B being downstream of that isn’t that crazy. The product has a user base not that different from other $10B products, and a growth rate to put basically all of them to shame, so I don’t think a $10B effect from Chat-GPT seems that unreasonable. There is only so much variance to go around, but Chat-GPT is absolutely massive in its impact.
This still seems totally unreasonable to me:
How much total investment do you think there is in AI in 2023?
How much variance do you think there is in the level of 2023 investment in AI? (Or maybe whatever other change you think is equivalent.)
How much influence are you giving to GPT-3, GPT-3.5, GPT-4? How much to the existence of OpenAI? How much to the existence of Google? How much to Jasper? How much to good GPUs?
I think it’s unlikely that the reception of ChatGPT increased OpenAI’s valuation by $10B, much less investment in OpenAI, even before thinking about replaceability. I think that Codex, GPT-4, DALL-E, etc. are all very major parts of the valuation.
I also think replaceability is a huge correction term here. I think it would be more reasonable to talk about moving how many dollars of investment how far forward in time.
I find a comparison with John Schulman here unimpressive if you want to argue progress on this was overdetermined, given the safety motivation by John, and my best guess being that if you had argued forcefully that RLHF was pushing on commercialization bottlenecks, that John would have indeed not worked on it.
I think John wants to make useful stuff, so I doubt this.
How much total investment do you think there is in AI in 2023?
My guess is total investment was around the $200B - $500B range, with about $100B of that into new startups and organizations, and around $100-$400B of that in organizations like Google and Microsoft outside of acquisitions. I have pretty high uncertainty on the upper end here, since I don’t know what fraction of Google’s revenue gets reinvested again into AI, how much Tesla is investing in AI, how much various governments are investing, etc.
How much variance do you think there is in the level of 2023 investment in AI? (Or maybe whatever other change you think is equivalent.)
Variance between different years depending on market condition and how much products take off seems like on the order of 50% to me. Like, different years have pretty hugely differing levels of investment.
My guess is about 50% of that variance is dependent on different products taking off, how much traction AI is getting in various places, and things like Chat-GPT existing vs. not existing.
So this gives around $50B - $125B of variance to be explained by product-adjacent things like Chat-GPT.
How much influence are you giving to GPT-3, GPT-3.5, GPT-4? How much to the existence of OpenAI? How much to the existence of Google? How much to Jasper? How much to good GPUs?
Existence of OpenAI is hard to disentangle from the rest. I would currently guess that in terms of total investment, GPT-2 → GPT-3 made a bigger difference than GPT-3.5 → Chat-GPT, but both made a much larger difference than GPT-3 → GPT-3.5.
I don’t think Jasper made a huge difference, since its userbase is much smaller than Chat-GPT, and also evidently the hype from it has been much lower.
Good GPUs feels kind of orthogonal. We can look at each product that makes up my 50% of the variance to be explained and see how useful/necessary good GPUs were for its development, and my sense is for Chat-GPT at least the effect of good GPUs were relatively minor since I don’t think the training to move from GPT-3.5 to Chat-GPT was very compute intensive.
I would feel fine saying expected improvements in GPUs are responsible for 25% of the 50% variance (i.e. 17.5%) if you chase things back all the way, though that again feels like it isn’t trying to add up to 100% with the impact from “Chat-GPT”. I do think it’s trying to add up to 100% with the impact from “RLHF’s effect on Chat-GPT”, which I claimed was at least 50% of the impact of Chat-GPT in-particular.
In any case, in order to make my case for $10B using these numbers I would have to argue that between 20% and 8% of the product-dependent variance in annual investment into AI is downstream of Chat-GPT, and indeed that still seems approximately right to me after crunching the numbers. It’s by far the biggest AI product of the last few years, it is directly credited with sparking an arms race between Google and Microsoft, and indeed even something as large as 40% wouldn’t seem totally crazy to me, since these kinds of things tend to be heavy-tailed, so if you select on the single biggest thing, there is a decent chance you underestimate its effect.
I didn’t realize how broadly you were defining AI investment. If you want to say that e.g ChatGPT increased investment by $10B out of $200-500B, so like +2-5%, I’m probably happy to agree (and I also think it had other accelerating effects beyond that).
I would guess that a 2-5% increase in total investment could speed up AGI timelines 1-2 weeks depending on details of the dynamics, like how fast investment was growing, how much growth is exogenous vs endogenous, diminishing returns curves, importance of human capital, etc.. If you mean +2-5% investment in a single year then I would guess the impact is < 1 week.
I haven’t thought about it much, but my all things considered estimate for the expected timelines slowdown if you just hadn’t done the ChatGPT release is probably between 1-4 weeks.
Is that the kind of effect size you are imagining here? I guess the more important dynamic is probably more people entering the space rather than timelines per se?
One thing worth pointing out in defense of your original estimate is that variance should add up to 100%, not effect sizes, so e.g. if the standard deviation is $100B then you could have 100 things each explaining ($10B)^2 of variance (and hence each responsible for +-$10B effect sizes after the fact).
I didn’t realize how broadly you were defining AI investment. If you want to say that e.g ChatGPT increased investment by $10B out of $200-500B, so like +2-5%, I’m probably happy to agree (and I also think it had other accelerating effects beyond that).
Makes sense, sorry for the miscommunication. I really didn’t feel like I was making a particularly controversial claim with the $10B, so was confused why it seemed so unreasonable to you.
I do think those $10B are going to be substantially more harmful for timelines than other money in AI, because I do think a good chunk of that money will much more directly aim at AGI than most other investment. I don’t know what my multiplier here for effect should be, but my guess is something around 3-5x in expectation (I’ve historically randomly guessed that AI applications are 10x less timelines-accelerating per dollar than full-throated AGI-research, but I sure have huge uncertainty about that number).
That, plus me thinking there is a long tail with lower probability where Chat-GPT made a huge difference in race dynamics, and thinking that this marginal increase in investment does probably translate into increases in total investment, made me think this was going to shorten timelines in-expectation by something closer to 8-16 weeks, which isn’t enormously far away from yours, though still a good bit higher.
And yeah, I do think the thing I am most worried about with Chat-GPT in addition to just shortening timelines is increasing the number of actors in the space, which also has indirect effects on timelines. A world where both Microsoft and Google are doubling down on AI is probably also a world where AI regulation has a much harder time taking off. Microsoft and Google at large also strike me as much less careful actors than the existing leaders of AGI labs which have so far had a lot of independence (which to be clear, is less of an endorsement of current AGI labs, and more of a statement about very large moral-maze like institutions with tons of momentum). In-general the dynamics of Google and Microsoft racing towards AGI sure is among my least favorite takeoff dynamics in terms of being able to somehow navigate things cautiously.
One thing worth pointing out in defense of your original estimate is that variance should add up to 100%, not effect sizes, so e.g. if the standard deviation is $100B then you could have 100 things each explaining ($10B)^2 of variance (and hence each responsible for +-$10B effect sizes after the fact).
Oh, yeah, good point. I was indeed thinking of the math a bit wrong here. I will think a bit about how this adjusts my estimates, though I think I was intuitively taking this into account.
And yeah, I do think the thing I am most worried about with Chat-GPT in addition to just shortening timelines is increasing the number of actors in the space, which also has indirect effects on timelines. A world where both Microsoft and Google are doubling down on AI is probably also a world where AI regulation has a much harder time taking off.
Maybe—but Microsoft and Google are huge organizations, and huge organizations have an incentive to push for regulation that imposes costs that they can pay while disproportionately hampering smaller competitors. It seems plausible to me that both M & G might prefer a regulatory scheme that overall slows down progress while cementing their dominance, since that would be a pretty standard regulatory-capture-driven-by-the-dominant-actors-in-the-field kind of scenario.
A sudden wave of destabilizing AI breakthroughs—with DALL-E/Midjourney/Stable Diffusion suddenly disrupting art and Chat-GPT who-knows-how-many-things—can also make people on the street concerned and both more supportive of AI regulation in general, as well as more inclined to take AGI scenarios seriously in particular. I recently saw a blog post from someone speculating that this might cause a wide variety of actors—M & G included—with a desire to slow down AI progress to join forces to push for widespread regulation.
It seems plausible to me that both M & G might prefer a regulatory scheme that overall slows down progress while cementing their dominance, since that would be a pretty standard regulatory-capture-driven-by-the-dominant-actors-in-the-field kind of scenario.
Interesting. Where did something like this happen?
I asked Chat-GPT and one of the clearest examples it came up with is patent trolling by large pharmaceutical companies. Their lobbying tends to be far more focused on securing monopoly rights to their products for as long as possible than anything related to innovation.
Other examples:
Automakers lobbying for restrictive standards for potential market disruptors like electric or self-driving vehicles
Telecoms lobbying against Net Neutrality
Taxi companies lobbying against ridesharing startups
Tech companies lobbying for intellectual property and data privacy regulations that they have better legal/compliance resources to handle
IMO it’s much easier to support high investment numbers in “AI” if you consider lots of semiconductor / AI hardware startup stuff as “AI investments”. My suspicion is that while GPUs were primarily a crypto thing for the last few years, the main growth outlook driving more investment is them being an AI thing.
I’d be interested to know how you estimate the numbers here, they seem quite inflated to me.
If 4 big tech companies were to invest $50B each in 2023 then, assuming average salary as $300k and 2:1 capital to salary then investment would be hiring about 50B/900K = 55,000 people to work on this stuff. For reference the total headcount at these orgs is roughly 100-200K.
50B/yr is also around 25-50% of the size of the total income, and greater than profits for most which again seems high.
Perhaps my capital ratio is way too low but I would find it hard to believe that these companies can meaningfully put that level of capital into action so quickly. I would guess more on the order of $50B between the major companies in 2023.
Agree with paul’s comment above that timeline shifts are the most important variable.
I think the qualitative difference between the supervised tuning done in text-davinci-002 and the RLHF in text-davinci-003 is modest (e.g. I’ve seen head-to-head comparisons suggesting real but modest effects on similar tasks).
As far as I can tell this resulted in a system with much worse economic viability than Chat-GPT. I would overall describe Sydney as “economically unviable”, such that if Gwern’s story here is correct, the difference between using straightforward supervised training on chat transcripts and OpenAIs RLHF pipeline is indeed the difference between an economically viable and unviable product.
There is a chance that Microsoft fixes this with more supervised training, but my current prediction is that they will have to fix this with RLHF, because the other technological alternatives are indeed no adequate substitutes from an economic viability perspective, which suggests that the development of RLHF did really matter a lot for this.
Benchmarking on static datasets on ordinary tasks (typically not even adversarially collected in the first place) may not be a good way to extrapolate to differences in level of abuse for PR-sensitive actors like megacorps, especially for abusers that are attacking the retrieval functionality (as Sydney users explicitly were trying to populate Bing hits to steer Sydney), a functionality not involved in said benchmarking at all. Or to put it another way, the fact that text-davinci-003 does only a little better than text-davinci-002 in terms of accuracy % may tell you little about how profitable in $ each will be once 4chan & the coomers get their hands on it… It is not news to anyone here that average-case performance on proxy metrics on some tame canned datasets may be unrelated to out-of-distribution robustness on worst-case adversary-induced decision-relevant losses, in much the same way that model perplexity tells us little about what a model is useful for or how vulnerable it is.
Yeah, this is basically my point. Not sure whether whether you are agreeing or disagreeing. I was specifically quoting Paul’s comment saying “I’ve seen only modest qualitative differences” in order to disagree and say “I think we’ve now seen substantial qualitative differences”.
We have had 4chan play around with Chat-GPT for a while, with much less disastrous results than what happened when they got access to Sydney.
It is not news to anyone here that average-case performance on proxy metrics on some tame canned datasets may be unrelated to out-of-distribution robustness on worst-case adversary-induced decision-relevant losses, in much the same way that model perplexity tells us little about what a model is useful for or how vulnerable it is.
I wish that this not being news to anyone here was true but this does not currently seem true to me. But doesn’t seem worth going into.
I was elaborating in more ML-y jargon, and also highlighting that there are a lot of wildcards omitted from Paul’s comparison: retrieval especially was an interesting dynamic.
For what it’s worth, I buy the claim from Gwern that Microsoft trained Sydney pretty poorly, much worse than is achievable with SFT on highly rated data. For example, Sydney shows significant repetition, which you don’t see even on text-davinci-002 or (early 2022) LaMDA, both trained without RLHF.
Yep, I think it’s pretty plausible this is just a data-quality issue, though I find myself somewhat skeptical of this. Maybe worth a bet?
I would be happy to bet that conditional on them trying to solve this with more supervised training and no RLHF, we are going to see error modes substantially more catastrophic than current Chat-GPT.
Supervised data seems way more fine-grained in what you are getting the AI to do. It’s just that supervised fine-tuning is worse.
My (pretty uninformed) guess here is that supervised fine-tuning vs RLHF has relatively modest differences in terms of producing good responses, but bigger differences in terms of avoiding bad responses. And it seems reasonable to model decisions about product deployments as being driven in large part by how well you can get AI not to do what you don’t want it to do.
And it seems reasonable to model decisions about product deployments as being driven in large part by how well you can get AI not to do what you don’t want it to do.
It depends a lot on the use case.
When it comes to what I’m doing with ChatGPT, I care more about the quality of the best answer when I generate five answers to a prompt than I care about the quality of the worst answer. I can choose the best answer myself and ignore the others.
Many use cases have ways to filter for valuable results either automatically or by letting a human filter.
I think it’s unlikely that the reception of ChatGPT increased OpenAI’s valuation by $10B, much less investment in OpenAI, even before thinking about replaceability.
Note that I never said this, so I am not sure what you are responding to. I said Chat-GPT increases investment in AI by $10B, not that it increased investment into specifically OpenAI. Companies generally don’t have perfect mottes. Most of that increase in investment is probably in internal Google allocation and in increased investment into the overall AI industry.
Feb 1 (Reuters) - ChatGPT, the popular chatbot from OpenAI, is estimated to have reached 100 million monthly active users in January, just two months after launch, making it the fastest-growing consumer application in history, according to a UBS study on Wednesday.
The report, citing data from analytics firm Similarweb, said an average of about 13 million unique visitors had used ChatGPT per day in January, more than double the levels of December.
“In 20 years following the internet space, we cannot recall a faster ramp in a consumer internet app,” UBS analysts wrote in the note.
I had some decent probability on this outcome but I have increased my previous estimate of the impact of Chat-GPT by 50%, since I didn’t expect something this radical (“the single fastest growing consumer product in history”).
I feel like the implicit model of the world you are using here is going to have effect sizes adding up to much more than the actual variance at stake.
That’s not always the wrong thing to do—the sum of counterfactual impacts of the actions of many actors often sums up to greater than their total combined impact. A simple example would be if two co-founders of an impactful company wouldn’t have been a founder without the other. Then the sum of their counterfactual impacts is equivalent to 2 times the total impact of the company.
While I don’t have an opinion on this particular case, you could imagine that additional AI investment may not have happened if either of the following were true:
1. The original RLHF proof of concept from OpenAI didn’t happen—because Google’s leadership wouldn’t have the incentive for further investment.
2. If Google’s leadership were different—because they may not have thought to invest more money in AI.
my guess is most of that success is attributable to the work on RLHF, since that was really the only substantial difference between Chat-GPT and GPT-3
I don’t think this is right—the main hype effect of chatGPT over previous models feels like it’s just because it was in a convenient chat interface that was easy to use and free. My guess is that if you did a head-to-head comparison of RLHF and kludgey random hacks involving imitation and prompt engineering, they’d seem similarly cool to a random journalist / VC, and generate similar excitement.
I don’t think this is right—the main hype effect of chatGPT over previous models feels like it’s just because it was in a convenient chat interface that was easy to use and free.
I don’t have extensive relevant expertise, but as a personal datapoint: I used Davinci-002 multiple times to generate an interesting dialogue in order to test its capabilities. I ran several small-scale Turing tests, and the results were quite unimpressive in my opinion. When ChatGPT came out, I tried it out (on the day of its release) and very quickly felt that it was qualitatively better at dialogue. Of course, I could have simply been prompting Davinci-002 poorly, but overall I’m quite skeptical that the main reason for ChatGPT hype was that it had a more convenient chat interface than GPT-3.
That makes sense. However, Davinci-003 came out just a few days prior to ChatGPT. The relevant transition was from Davinci-002 to Davinci-003/ChatGPT.
Yep, and text-davinci-002 was trained with supervised finetuning / written demos, while 003 was trained with RLHF via PPO. Hypothetically, the clearest illustration of RLHF’s capabilities gains should be from comparing 002 to 003. However, OpenAI could have also used other methods to improve 003, such as with Transcending Scaling Laws with 0.1% Extra Compute.
Our models generally used the best available datasets at the time of training, and so different engines using the same training methodology might be trained on different data.
So I guess 003 could also have different base pretraining data?
[edit: this says the same thing as Quintin’s sibling comment]
Important context for those who don’t know it: the main difference between text-davinci-002 and text-davinci-003 is that the latter was trained with PPO against a reward model, i.e. RLHF as laid out in the InstructGPT paper. (Source: OpenAI model index.)
In more detail, text-davinci-002 seems to have been trained via supervised fune-tuning on the model outputs which were rated highest by human reviewers (this is what the model index calls FeedME). The model index only says that text-davinci-003 was trained via PPO against a reward model, but this was after SFT on human demonstrations, and might have also been after FeedME training.
(Aside: the terminology “RLHF” is starting to become confusing, as some people use it narrowly to mean “PPO against a reward model” and others use it more broadly to mean “using any RL technique with a reward signal given by human reviewers,” which would include FeedME.)
The terminology “RLHF” is starting to become confusing, as some people use it narrowly to mean “PPO against a reward model” and others use it more broadly to mean “using any RL technique with a reward signal given by human reviewers,” which would include FeedME.
Sorry for getting off track, but I thought FeedME did not use RL on the final model, only supervised training? Or do you just mean that the FeedME-trained models may have been fed inputs from models that had been RL-finetuned (namely the one from the InstructGPT paper)? Not sure if OpenAI said anywhere whether the latter was the case, or whether FeedME just uses inputs from non-RL models.
This is just a terminological difference: supervised fine-tuning on highly rated outputs is a type of RL. (At least according to how many people use the term.)
Got a source for that? This seems like an odd way to use the term, in particular because with supervised fine-tuning there’s no credit assignment over time, and so it doesn’t train the model to actually aim towards high-reward states.
To be clear, I’m not classifying all uses of SFT as RL (for example, I would not call SFT on human expert demonstrations RL). It’s specifically SFT on highly-rated model outputs—i.e. having the model produce a bunch of rollouts, labeling them with rewards, training the model to imitate the top-rewarded rollouts, and repeating—which I’m calling RL here. Note that this training process does aim the model towards high-reward, and is very similar to the online decision transformer, which is typically classed as an RL technique.
So I still feel that the way I used the term “RL” was in line with normal usage. But if people still disagree now that I’ve explained myself in more detail, I’d be interested in hearing why.
Two central features of RL in my mind, which distinguish it from imitation learning:
Receiving reward in a given state make the policy more likely to navigate to that state in general (not just via the specific pathway in which it happened to reach that state) - i.e. there’s efficient credit assignment through time.
(In theory) small differences in reward can lead to big differences in behavior, i.e. there’s mode collapse to the highest-expected-reward policy.
Q-learning is therefore a central example of RL, alongside actor-critic algorithms.
Online REINFORCE has very dumb credit assignment, but it does eventually leads to mode collapse to highest-expected-reward policy. So I count this as… like 75% RL, but a less central example than Q-learning.
Online high-rated SFT also has poor credit assignment, in a similar way as online REINFORCE. Meanwhile, whether or not it converges to the highest-reward policy depends on how the ratings are generated. If there’s a bucket of high-reward trajectories such that all sufficiently-good trajectories go in there, then it’ll never learn to do better than a typical trajectory from that bucket. This feels more like online imitation learning (e.g. stuff like DAgger) which people don’t call RL.
By contrast, if there’s an underlying “true” reward function and the probability that a trajectory is highly-rated depends (monotonically) on its true reward, then eventually it’ll converge to only ever taking the highest-reward trajectories, which feels more centrally RL to me.
Idk how much sense this makes, it all feels a bit fuzzy. My immediate conclusion is that we should mostly care about the three traits of “online”, “state-wise credit assignment” and “converges to sharp optimum” separately, rather than trying to figure out which combination of them counts as RL (except that anything with state-wise credit assignment is definitely RL).
I appreciate your clear articulation of the point about incentivizing the agent to navigate to high-reward states in a trajectory-independent way (in contrast to learning to produce trajectories like those which historically got high reward). That said, I’m confused about how you’ve labeled the methods you mention as having vs. not having this property.
To make sure we’re on the same page, suppose we’re in an environment with a state s∗ which is high reward, and suppose that there are two ways to get to state s∗: via the two trajectories (s,a,s∗) and (s′,a′,s∗). Suppose further that historically the agent has only navigated to this state via the former trajectory (s,a,s∗).
I agree that if the agent was trained via REINFORCE and finds itself in state s′ that it might not know to take action a′ (because it’s only been reinforced to take action a from state s, and not to reach state s∗; and also because it might not know that a′ would transition it to state s∗).
But this also seems true if the agent were trained via Q-learning with a Q-function Q(s,a): the Q-function need not have learned that Q(s′,a′) is large, only that Q(s,a) is large.
In either the REINFORCE or the Q-learning case, once the agent sees a trajectory (s′,a′,s∗), it will make an update towards taking action a′ from state s′, but the size of the update seems to depend on details about the network implementing the policy or Q-function—if there’s some obvious reason that the Q-learner will necessarily make a larger update, I’ve missed it.
I think the above also applies in the case of actor-critic methods where the critic is implemented by a Q-function. And I think it still applies even if the critic is a value function V(s), but I’m less confident: the critic has the assumption baked in that rewards come only from states, but the actor still doesn’t, so this might have similar dynamics to REINFORCE. (And if it ends up that this does do better, it’s only by baking in an assumption about the environment—that rewards come from the states and not the specific trajectories—which isn’t true in all environments.)
So I don’t follow why Q-learning and actor-critic methods on one hand, and REINFORCE and FeedME on the other hand, lie on opposite sides of the “learn to navigate to high-reward states in a trajectory-independent way” spectrum.
(I enjoyed thinking through the details here, by the way, so thanks for prompting that.)
I think your example is too simple to capture the relevant phenomenon. Here’s one which does: suppose state s3 gives high reward, state s4 gives medium reward, and state s5 gives low reward. You’ve seen the following trajectories:
s2 → s3
s1 → s4
s1 → s2 → s5
Then q-learning will learn quickly that it should go s1 → s2 → s3, whereas REINFORCE and SFT will need to do further exploration before learning that.
I feel uncertain about how to think about the implications of this claim in the context of more complex environments, though. In some sense it only happens because q-learning is doing a one-step lookahead, which isn’t really scalable. (That also isn’t true of all critics.)
It feels like I might have just come up with a new name for “RL algorithms which work on offline data”, which is presumably not a crucial distinction.
Ah, nice example! I now see your point, and I agree with everything you wrote. Whereas REINFORCE and SFT only incentivize actions which in fact were historically part of high-reward trajectories, Q-learning and actor-critic incentivize actions which comprise trajectories that one can infer would be high-reward (even if those actions never actually appeared in high-reward trajectories previously).
To throw in another perspective, I’ve been working with the OpenAI API models most days of the week for the past year or so. For my uses, the step-change in quality came from moving from base davinci to text-davinci-002, whereas the improvements moving from that to text-davinci-003 were decidedly less clear.
I agree the difference between base and 002 is bigger than the difference between 002 and 003. The base model needs to be carefully coaxed into a scenario where plausible continuations of the prompt align with your intended output, and even then it’s very inclined to repeat stuff and degenerates quickly. By contrast, you can just tell 002 what to do, and it will usually at least try to do what you say.
Seems like you’re implying that davinci is the base model for 002 and 003. That’s not the case; davinci has one base model (GPT-3) and then 002 and 003 share a different base model (GPT-3.5).
Fair. I think the crucial question to Ajeya & Matthew’s discussion of “Why the hype now?” is exactly how much worse the non-RLHF models that had been available since at least last March (davinci, code-davinci-002, text-davinci-002) actually were than the RLHF models made available just recently (text-davinci-003 and ChatGPT’s underlying model). I stand by the opinion that the besides the new chat stuff, most of the improvement happened within the old cohort, rather than between cohorts, so I attribute the recent hype to the convenient and free chat interface.
People seem pretty impressed with CharacterAI, which seems to get most of its character-specific info from prompting and having finetuned on roleplay dialog. However, it’s also possible that CharacterAI’s base models are RLHF’d to be consistent roleplayers.
I think the part where it has a longer memory/coherence feels like a major shift (having gotten into the flow of experimenting with GPT3 in the month prior to chatGPT, I felt like the two interfaces were approximately as convenient)
I don’t know what mechanism was used to generate the longer coherence though.
I am very confused why you think this, just right after the success of Chat-GPT, where approximately the only difference from GPT-3 was the presence of RLHF.
My current best guess is that Chat-GPT alone, via sparking an arms-race between Google and Microsoft, and by increasing OpenAIs valuation, should be modeled as the equivalent of something on the order of $10B of investment into AI capabilities research, completely in addition to the gains from GPT-3.
And my guess is most of that success is attributable to the work on RLHF, since that was really the only substantial difference between Chat-GPT and GPT-3. We also should not think this was overdetermined since 1.5 years passed since the release of GPT-3 and the release of Chat-GPT (with some updates to GPT-3 in the meantime, but my guess is no major ones), and no other research lab focused on capabilities had set up their own RLHF pipeline (except Anthropic, which I don’t think makes sense to use as a datapoint here, since it’s in substantial parts the same employees).
I have been trying to engage with the actual details here, and indeed have had a bunch of arguments with people over the last 2 years where I have been explicitly saying that RLHF is pushing on commercialization bottlenecks based on those details, and people believing this was not the case was the primary crux on whether RLHF was good or bad in those conversations.
The crux was importantly not that other people would do the same work anyways, since people at the same time also argued that their work on RLHF was counterfactually relevant and that it’s pretty plausible or likely that the work would otherwise not happen. I’ve had a few of these conversations with you as well (though in aggregate not a lot) and your take at the time was (IIRC) that it seemed quite unlikely that RLHF would have as big of an effect as it did have in the case of Chat-GPT (mostly via an efficiency argument that if that was the case, more capabilities-oriented people would work on it, and since they weren’t it likely isn’t a commercialization bottleneck), and so I do feel a bit like I want to call you out on that, though I might also be misremembering the details (some of this was online, so might be worth going back through our comment histories).
I think the qualitative difference between the supervised tuning done in text-davinci-002 and the RLHF in text-davinci-003 is modest (e.g. I’ve seen head-to-head comparisons suggesting real but modest effects on similar tasks).
I think the much more important differences are:
It was trained to interact directly with the end user as a conversational assistant rather than in an API intended to be used by developers.
It was deployed in a way that made it much easier for more people to interact with it.
People hadn’t appreciated progress since GPT-3, or even how good GPT-3 was, and this went viral (due to a combination of 1+2).
If there are large capability differences I expect they are mostly orthogonal improvements.
I think the effect would have been very similar if it had been trained via supervised learning on good dialogs.
ChatGPT was impactful because of a big mismatch between people’s perceptions of LM abilities and reality. That gap was going to get closed sooner or later (if not now then probably at the GPT-4 release). I think it’s reasonable to think that this was a really destructive decision by OpenAI, but I don’t think it’s reasonable to treat it as a counterfactual $10B of investment.
I feel like the implicit model of the world you are using here is going to have effect sizes adding up to much more than the actual variance at stake. How impactful was the existence of OpenAI? Leadership decisions at Google? Microsoft’s willingness to invest in OpenAI? The surprising effectiveness of transformers? Google originally deciding not to scale up LMs aggressively? The training of PaLM? The original GPT-3 release decisions? The fact that LM startups are raising at billion dollar valuations? The fact that LM applications are making hundreds of millions of dollars? These sources of variance all add up to 100% of the variance in AI investment, not 100000% of the variance.
I think it’s a persistent difference between us that I tend to think fundamentals matter more and you tend to think things are more contingent and random. I tend to find your causal attribution implausible in other technologies as well as AI.
There were significant capability increases between GPT-3 an GPT-3.5 (not to mention the introduction of the earlier InstructGPT training).
My position was and is:
RLHF was definitely going to be done sooner or later. (I’ve definitely never thought that RLHF would never happen.)
It’s valuable to do it earlier to get started on the next thing. It’s also good to push people to something cleaner and more flexible rather than something more hacky or with no knob to change the reward function.
We were doing it before it was a big deal commercially; it would have got done later when it mattered.
To be clear, sample efficiency might be high enough later that you just use the AI’s zero-shot predictions of humans instead of collecting any new specialized data, which we also discussed specifically at the time.
I’m pretty skeptical that no one else would do RLHF. For ChatGPT in particular, I think it was built by John Schulman’s team, and John is: (i) focused on RL, (ii) pivoted to LMs after the success of GPT-3 relative to non-LM models and would have done so without RLHF, (iii) has a similar aesthetic and would pretty obviously do this or something else equally good.
I think the most likely world where people don’t adopt RLHF is one where other hackier alternatives work just as well. And it won’t be from no one trying.
I think the big argument against impact I find most compelling is: most follow-up work to RLHF didn’t work that well for GPT-3 and seem to have started working after that, so you could have just waited until people would do it anyway and in the interim focused on approaches that work better at smaller scale. I think the big miscalculation here was that I expected debate/decomposition stuff would start working interestingly with curie-sized models but was off by about 2 orders of magnitude.
I think the big argument for negative impact comes from safety-motivated folk being involved in training language models, not the RLHF stuff. I also disagree with the rationalists about their evaluations of pretty much everything, but that one feels like a more interesting disagreement.
I don’t currently think this is the case, and seems like the likely crux. In general it seems that RLHF is substantially more flexible in what kind of target task it allows you to train for, which is the whole reason for why you are working on it, and at least my model of the difficulty of generating good training data for supervised learning here is that it would have been a much greater pain, and would have been much harder to control in various fine-grained ways (including preventing the AI from saying controversial things), which had been the biggest problem with previous chat bot attempts.
I find a comparison with John Schulman here unimpressive if you want to argue progress on this was overdetermined, given the safety motivation by John, and my best guess being that if you had argued forcefully that RLHF was pushing on commercialization bottlenecks, that John would have indeed not worked on it.
Seeing RLHF teams in other organizations not directly downstream of your organizational involvement, or not quite directly entangled with your opinion, would make a bigger difference here.
I don’t think so, and have been trying to be quite careful about this. Chat-GPT is just by far the most successful AI product to date, with by far the biggest global impact on AI investment and the most hype. I think $10B being downstream of that isn’t that crazy. The product has a user base not that different from other $10B products, and a growth rate to put basically all of them to shame, so I don’t think a $10B effect from Chat-GPT seems that unreasonable. There is only so much variance to go around, but Chat-GPT is absolutely massive in its impact.
I bet they did generate supervised data (certainly they do for InstructGPT), and supervised data seems way more fine-grained in what you are getting the AI to do. It’s just that supervised fine-tuning is worse.
I think the biggest problem with previous chat-bot attempts is that the underlying models are way way weaker than GPT-3.5.
This still seems totally unreasonable to me:
How much total investment do you think there is in AI in 2023?
How much variance do you think there is in the level of 2023 investment in AI? (Or maybe whatever other change you think is equivalent.)
How much influence are you giving to GPT-3, GPT-3.5, GPT-4? How much to the existence of OpenAI? How much to the existence of Google? How much to Jasper? How much to good GPUs?
I think it’s unlikely that the reception of ChatGPT increased OpenAI’s valuation by $10B, much less investment in OpenAI, even before thinking about replaceability. I think that Codex, GPT-4, DALL-E, etc. are all very major parts of the valuation.
I also think replaceability is a huge correction term here. I think it would be more reasonable to talk about moving how many dollars of investment how far forward in time.
I think John wants to make useful stuff, so I doubt this.
My guess is total investment was around the $200B - $500B range, with about $100B of that into new startups and organizations, and around $100-$400B of that in organizations like Google and Microsoft outside of acquisitions. I have pretty high uncertainty on the upper end here, since I don’t know what fraction of Google’s revenue gets reinvested again into AI, how much Tesla is investing in AI, how much various governments are investing, etc.
Variance between different years depending on market condition and how much products take off seems like on the order of 50% to me. Like, different years have pretty hugely differing levels of investment.
My guess is about 50% of that variance is dependent on different products taking off, how much traction AI is getting in various places, and things like Chat-GPT existing vs. not existing.
So this gives around $50B - $125B of variance to be explained by product-adjacent things like Chat-GPT.
Existence of OpenAI is hard to disentangle from the rest. I would currently guess that in terms of total investment, GPT-2 → GPT-3 made a bigger difference than GPT-3.5 → Chat-GPT, but both made a much larger difference than GPT-3 → GPT-3.5.
I don’t think Jasper made a huge difference, since its userbase is much smaller than Chat-GPT, and also evidently the hype from it has been much lower.
Good GPUs feels kind of orthogonal. We can look at each product that makes up my 50% of the variance to be explained and see how useful/necessary good GPUs were for its development, and my sense is for Chat-GPT at least the effect of good GPUs were relatively minor since I don’t think the training to move from GPT-3.5 to Chat-GPT was very compute intensive.
I would feel fine saying expected improvements in GPUs are responsible for 25% of the 50% variance (i.e. 17.5%) if you chase things back all the way, though that again feels like it isn’t trying to add up to 100% with the impact from “Chat-GPT”. I do think it’s trying to add up to 100% with the impact from “RLHF’s effect on Chat-GPT”, which I claimed was at least 50% of the impact of Chat-GPT in-particular.
In any case, in order to make my case for $10B using these numbers I would have to argue that between 20% and 8% of the product-dependent variance in annual investment into AI is downstream of Chat-GPT, and indeed that still seems approximately right to me after crunching the numbers. It’s by far the biggest AI product of the last few years, it is directly credited with sparking an arms race between Google and Microsoft, and indeed even something as large as 40% wouldn’t seem totally crazy to me, since these kinds of things tend to be heavy-tailed, so if you select on the single biggest thing, there is a decent chance you underestimate its effect.
I didn’t realize how broadly you were defining AI investment. If you want to say that e.g ChatGPT increased investment by $10B out of $200-500B, so like +2-5%, I’m probably happy to agree (and I also think it had other accelerating effects beyond that).
I would guess that a 2-5% increase in total investment could speed up AGI timelines 1-2 weeks depending on details of the dynamics, like how fast investment was growing, how much growth is exogenous vs endogenous, diminishing returns curves, importance of human capital, etc.. If you mean +2-5% investment in a single year then I would guess the impact is < 1 week.
I haven’t thought about it much, but my all things considered estimate for the expected timelines slowdown if you just hadn’t done the ChatGPT release is probably between 1-4 weeks.
Is that the kind of effect size you are imagining here? I guess the more important dynamic is probably more people entering the space rather than timelines per se?
One thing worth pointing out in defense of your original estimate is that variance should add up to 100%, not effect sizes, so e.g. if the standard deviation is $100B then you could have 100 things each explaining ($10B)^2 of variance (and hence each responsible for +-$10B effect sizes after the fact).
Makes sense, sorry for the miscommunication. I really didn’t feel like I was making a particularly controversial claim with the $10B, so was confused why it seemed so unreasonable to you.
I do think those $10B are going to be substantially more harmful for timelines than other money in AI, because I do think a good chunk of that money will much more directly aim at AGI than most other investment. I don’t know what my multiplier here for effect should be, but my guess is something around 3-5x in expectation (I’ve historically randomly guessed that AI applications are 10x less timelines-accelerating per dollar than full-throated AGI-research, but I sure have huge uncertainty about that number).
That, plus me thinking there is a long tail with lower probability where Chat-GPT made a huge difference in race dynamics, and thinking that this marginal increase in investment does probably translate into increases in total investment, made me think this was going to shorten timelines in-expectation by something closer to 8-16 weeks, which isn’t enormously far away from yours, though still a good bit higher.
And yeah, I do think the thing I am most worried about with Chat-GPT in addition to just shortening timelines is increasing the number of actors in the space, which also has indirect effects on timelines. A world where both Microsoft and Google are doubling down on AI is probably also a world where AI regulation has a much harder time taking off. Microsoft and Google at large also strike me as much less careful actors than the existing leaders of AGI labs which have so far had a lot of independence (which to be clear, is less of an endorsement of current AGI labs, and more of a statement about very large moral-maze like institutions with tons of momentum). In-general the dynamics of Google and Microsoft racing towards AGI sure is among my least favorite takeoff dynamics in terms of being able to somehow navigate things cautiously.
Oh, yeah, good point. I was indeed thinking of the math a bit wrong here. I will think a bit about how this adjusts my estimates, though I think I was intuitively taking this into account.
Maybe—but Microsoft and Google are huge organizations, and huge organizations have an incentive to push for regulation that imposes costs that they can pay while disproportionately hampering smaller competitors. It seems plausible to me that both M & G might prefer a regulatory scheme that overall slows down progress while cementing their dominance, since that would be a pretty standard regulatory-capture-driven-by-the-dominant-actors-in-the-field kind of scenario.
A sudden wave of destabilizing AI breakthroughs—with DALL-E/Midjourney/Stable Diffusion suddenly disrupting art and Chat-GPT who-knows-how-many-things—can also make people on the street concerned and both more supportive of AI regulation in general, as well as more inclined to take AGI scenarios seriously in particular. I recently saw a blog post from someone speculating that this might cause a wide variety of actors—M & G included—with a desire to slow down AI progress to join forces to push for widespread regulation.
Interesting. Where did something like this happen?
I asked Chat-GPT and one of the clearest examples it came up with is patent trolling by large pharmaceutical companies. Their lobbying tends to be far more focused on securing monopoly rights to their products for as long as possible than anything related to innovation.
Other examples:
Automakers lobbying for restrictive standards for potential market disruptors like electric or self-driving vehicles
Telecoms lobbying against Net Neutrality
Taxi companies lobbying against ridesharing startups
Tech companies lobbying for intellectual property and data privacy regulations that they have better legal/compliance resources to handle
IMO it’s much easier to support high investment numbers in “AI” if you consider lots of semiconductor / AI hardware startup stuff as “AI investments”. My suspicion is that while GPUs were primarily a crypto thing for the last few years, the main growth outlook driving more investment is them being an AI thing.
I’d be interested to know how you estimate the numbers here, they seem quite inflated to me.
If 4 big tech companies were to invest $50B each in 2023 then, assuming average salary as $300k and 2:1 capital to salary then investment would be hiring about 50B/900K = 55,000 people to work on this stuff. For reference the total headcount at these orgs is roughly 100-200K.
50B/yr is also around 25-50% of the size of the total income, and greater than profits for most which again seems high.
Perhaps my capital ratio is way too low but I would find it hard to believe that these companies can meaningfully put that level of capital into action so quickly. I would guess more on the order of $50B between the major companies in 2023.
Agree with paul’s comment above that timeline shifts are the most important variable.
Ok, I think we might now have some additional data on this debate. It does indeed look like to me that Sydney was trained with the next best available technology after RLHF, for a few months, at least based on Gwern’s guesses here: https://www.lesswrong.com/posts/jtoPawEhLNXNxvgTT/bing-chat-is-blatantly-aggressively-misaligned?commentId=AAC8jKeDp6xqsZK2K
As far as I can tell this resulted in a system with much worse economic viability than Chat-GPT. I would overall describe Sydney as “economically unviable”, such that if Gwern’s story here is correct, the difference between using straightforward supervised training on chat transcripts and OpenAIs RLHF pipeline is indeed the difference between an economically viable and unviable product.
There is a chance that Microsoft fixes this with more supervised training, but my current prediction is that they will have to fix this with RLHF, because the other technological alternatives are indeed no adequate substitutes from an economic viability perspective, which suggests that the development of RLHF did really matter a lot for this.
Benchmarking on static datasets on ordinary tasks (typically not even adversarially collected in the first place) may not be a good way to extrapolate to differences in level of abuse for PR-sensitive actors like megacorps, especially for abusers that are attacking the retrieval functionality (as Sydney users explicitly were trying to populate Bing hits to steer Sydney), a functionality not involved in said benchmarking at all. Or to put it another way, the fact that
text-davinci-003
does only a little better thantext-davinci-002
in terms of accuracy % may tell you little about how profitable in $ each will be once 4chan & the coomers get their hands on it… It is not news to anyone here that average-case performance on proxy metrics on some tame canned datasets may be unrelated to out-of-distribution robustness on worst-case adversary-induced decision-relevant losses, in much the same way that model perplexity tells us little about what a model is useful for or how vulnerable it is.Yeah, this is basically my point. Not sure whether whether you are agreeing or disagreeing. I was specifically quoting Paul’s comment saying “I’ve seen only modest qualitative differences” in order to disagree and say “I think we’ve now seen substantial qualitative differences”.
We have had 4chan play around with Chat-GPT for a while, with much less disastrous results than what happened when they got access to Sydney.
I wish that this not being news to anyone here was true but this does not currently seem true to me. But doesn’t seem worth going into.
I was elaborating in more ML-y jargon, and also highlighting that there are a lot of wildcards omitted from Paul’s comparison: retrieval especially was an interesting dynamic.
For what it’s worth, I buy the claim from Gwern that Microsoft trained Sydney pretty poorly, much worse than is achievable with SFT on highly rated data. For example, Sydney shows significant repetition, which you don’t see even on
text-davinci-002
or (early 2022) LaMDA, both trained without RLHF.Yep, I think it’s pretty plausible this is just a data-quality issue, though I find myself somewhat skeptical of this. Maybe worth a bet?
I would be happy to bet that conditional on them trying to solve this with more supervised training and no RLHF, we are going to see error modes substantially more catastrophic than current Chat-GPT.
My (pretty uninformed) guess here is that supervised fine-tuning vs RLHF has relatively modest differences in terms of producing good responses, but bigger differences in terms of avoiding bad responses. And it seems reasonable to model decisions about product deployments as being driven in large part by how well you can get AI not to do what you don’t want it to do.
It depends a lot on the use case.
When it comes to what I’m doing with ChatGPT, I care more about the quality of the best answer when I generate five answers to a prompt than I care about the quality of the worst answer. I can choose the best answer myself and ignore the others.
Many use cases have ways to filter for valuable results either automatically or by letting a human filter.
Note that I never said this, so I am not sure what you are responding to. I said Chat-GPT increases investment in AI by $10B, not that it increased investment into specifically OpenAI. Companies generally don’t have perfect mottes. Most of that increase in investment is probably in internal Google allocation and in increased investment into the overall AI industry.
Relevant piece of data: https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/?fbclid=IwAR3KTBnxC_y7n0TkrCdcd63oBuwnu6wyXcDtb2lijk3G-p9wdgD9el8KzQ4
I had some decent probability on this outcome but I have increased my previous estimate of the impact of Chat-GPT by 50%, since I didn’t expect something this radical (“the single fastest growing consumer product in history”).
That’s not always the wrong thing to do—the sum of counterfactual impacts of the actions of many actors often sums up to greater than their total combined impact. A simple example would be if two co-founders of an impactful company wouldn’t have been a founder without the other. Then the sum of their counterfactual impacts is equivalent to 2 times the total impact of the company.
While I don’t have an opinion on this particular case, you could imagine that additional AI investment may not have happened if either of the following were true:
1. The original RLHF proof of concept from OpenAI didn’t happen—because Google’s leadership wouldn’t have the incentive for further investment.
2. If Google’s leadership were different—because they may not have thought to invest more money in AI.
I don’t think this is right—the main hype effect of chatGPT over previous models feels like it’s just because it was in a convenient chat interface that was easy to use and free. My guess is that if you did a head-to-head comparison of RLHF and kludgey random hacks involving imitation and prompt engineering, they’d seem similarly cool to a random journalist / VC, and generate similar excitement.
I don’t have extensive relevant expertise, but as a personal datapoint: I used Davinci-002 multiple times to generate an interesting dialogue in order to test its capabilities. I ran several small-scale Turing tests, and the results were quite unimpressive in my opinion. When ChatGPT came out, I tried it out (on the day of its release) and very quickly felt that it was qualitatively better at dialogue. Of course, I could have simply been prompting Davinci-002 poorly, but overall I’m quite skeptical that the main reason for ChatGPT hype was that it had a more convenient chat interface than GPT-3.
I’ve felt that ChatGPT was roughly on par with text-davinci-003, though much more annoying and with a worse interface.
That makes sense. However, Davinci-003 came out just a few days prior to ChatGPT. The relevant transition was from Davinci-002 to Davinci-003/ChatGPT.
Yep, and text-davinci-002 was trained with supervised finetuning / written demos, while 003 was trained with RLHF via PPO. Hypothetically, the clearest illustration of RLHF’s capabilities gains should be from comparing 002 to 003. However, OpenAI could have also used other methods to improve 003, such as with Transcending Scaling Laws with 0.1% Extra Compute.
This page also says that:
So I guess 003 could also have different base pretraining data?
[edit: this says the same thing as Quintin’s sibling comment]
Important context for those who don’t know it: the main difference between text-davinci-002 and text-davinci-003 is that the latter was trained with PPO against a reward model, i.e. RLHF as laid out in the InstructGPT paper. (Source: OpenAI model index.)
In more detail, text-davinci-002 seems to have been trained via supervised fune-tuning on the model outputs which were rated highest by human reviewers (this is what the model index calls FeedME). The model index only says that text-davinci-003 was trained via PPO against a reward model, but this was after SFT on human demonstrations, and might have also been after FeedME training.
(Aside: the terminology “RLHF” is starting to become confusing, as some people use it narrowly to mean “PPO against a reward model” and others use it more broadly to mean “using any RL technique with a reward signal given by human reviewers,” which would include FeedME.)
Sorry for getting off track, but I thought FeedME did not use RL on the final model, only supervised training? Or do you just mean that the FeedME-trained models may have been fed inputs from models that had been RL-finetuned (namely the one from the InstructGPT paper)? Not sure if OpenAI said anywhere whether the latter was the case, or whether FeedME just uses inputs from non-RL models.
This is just a terminological difference: supervised fine-tuning on highly rated outputs is a type of RL. (At least according to how many people use the term.)
Got a source for that? This seems like an odd way to use the term, in particular because with supervised fine-tuning there’s no credit assignment over time, and so it doesn’t train the model to actually aim towards high-reward states.
To be clear, I’m not classifying all uses of SFT as RL (for example, I would not call SFT on human expert demonstrations RL). It’s specifically SFT on highly-rated model outputs—i.e. having the model produce a bunch of rollouts, labeling them with rewards, training the model to imitate the top-rewarded rollouts, and repeating—which I’m calling RL here. Note that this training process does aim the model towards high-reward, and is very similar to the online decision transformer, which is typically classed as an RL technique.
So I still feel that the way I used the term “RL” was in line with normal usage. But if people still disagree now that I’ve explained myself in more detail, I’d be interested in hearing why.
Two central features of RL in my mind, which distinguish it from imitation learning:
Receiving reward in a given state make the policy more likely to navigate to that state in general (not just via the specific pathway in which it happened to reach that state) - i.e. there’s efficient credit assignment through time.
(In theory) small differences in reward can lead to big differences in behavior, i.e. there’s mode collapse to the highest-expected-reward policy.
Q-learning is therefore a central example of RL, alongside actor-critic algorithms.
Online REINFORCE has very dumb credit assignment, but it does eventually leads to mode collapse to highest-expected-reward policy. So I count this as… like 75% RL, but a less central example than Q-learning.
Online high-rated SFT also has poor credit assignment, in a similar way as online REINFORCE. Meanwhile, whether or not it converges to the highest-reward policy depends on how the ratings are generated. If there’s a bucket of high-reward trajectories such that all sufficiently-good trajectories go in there, then it’ll never learn to do better than a typical trajectory from that bucket. This feels more like online imitation learning (e.g. stuff like DAgger) which people don’t call RL.
By contrast, if there’s an underlying “true” reward function and the probability that a trajectory is highly-rated depends (monotonically) on its true reward, then eventually it’ll converge to only ever taking the highest-reward trajectories, which feels more centrally RL to me.
Idk how much sense this makes, it all feels a bit fuzzy. My immediate conclusion is that we should mostly care about the three traits of “online”, “state-wise credit assignment” and “converges to sharp optimum” separately, rather than trying to figure out which combination of them counts as RL (except that anything with state-wise credit assignment is definitely RL).
I appreciate your clear articulation of the point about incentivizing the agent to navigate to high-reward states in a trajectory-independent way (in contrast to learning to produce trajectories like those which historically got high reward). That said, I’m confused about how you’ve labeled the methods you mention as having vs. not having this property.
To make sure we’re on the same page, suppose we’re in an environment with a state s∗ which is high reward, and suppose that there are two ways to get to state s∗: via the two trajectories (s,a,s∗) and (s′,a′,s∗). Suppose further that historically the agent has only navigated to this state via the former trajectory (s,a,s∗).
I agree that if the agent was trained via REINFORCE and finds itself in state s′ that it might not know to take action a′ (because it’s only been reinforced to take action a from state s, and not to reach state s∗; and also because it might not know that a′ would transition it to state s∗).
But this also seems true if the agent were trained via Q-learning with a Q-function Q(s,a): the Q-function need not have learned that Q(s′,a′) is large, only that Q(s,a) is large.
In either the REINFORCE or the Q-learning case, once the agent sees a trajectory (s′,a′,s∗), it will make an update towards taking action a′ from state s′, but the size of the update seems to depend on details about the network implementing the policy or Q-function—if there’s some obvious reason that the Q-learner will necessarily make a larger update, I’ve missed it.
I think the above also applies in the case of actor-critic methods where the critic is implemented by a Q-function. And I think it still applies even if the critic is a value function V(s), but I’m less confident: the critic has the assumption baked in that rewards come only from states, but the actor still doesn’t, so this might have similar dynamics to REINFORCE. (And if it ends up that this does do better, it’s only by baking in an assumption about the environment—that rewards come from the states and not the specific trajectories—which isn’t true in all environments.)
So I don’t follow why Q-learning and actor-critic methods on one hand, and REINFORCE and FeedME on the other hand, lie on opposite sides of the “learn to navigate to high-reward states in a trajectory-independent way” spectrum.
(I enjoyed thinking through the details here, by the way, so thanks for prompting that.)
I think your example is too simple to capture the relevant phenomenon. Here’s one which does: suppose state s3 gives high reward, state s4 gives medium reward, and state s5 gives low reward. You’ve seen the following trajectories:
s2 → s3
s1 → s4
s1 → s2 → s5
Then q-learning will learn quickly that it should go s1 → s2 → s3, whereas REINFORCE and SFT will need to do further exploration before learning that.
I feel uncertain about how to think about the implications of this claim in the context of more complex environments, though. In some sense it only happens because q-learning is doing a one-step lookahead, which isn’t really scalable. (That also isn’t true of all critics.)
It feels like I might have just come up with a new name for “RL algorithms which work on offline data”, which is presumably not a crucial distinction.
Ah, nice example! I now see your point, and I agree with everything you wrote. Whereas REINFORCE and SFT only incentivize actions which in fact were historically part of high-reward trajectories, Q-learning and actor-critic incentivize actions which comprise trajectories that one can infer would be high-reward (even if those actions never actually appeared in high-reward trajectories previously).
Flagging that I would find that use of the term super confusing.
To throw in another perspective, I’ve been working with the OpenAI API models most days of the week for the past year or so. For my uses, the step-change in quality came from moving from base
davinci
totext-davinci-002
, whereas the improvements moving from that totext-davinci-003
were decidedly less clear.I agree the difference between base and 002 is bigger than the difference between 002 and 003. The base model needs to be carefully coaxed into a scenario where plausible continuations of the prompt align with your intended output, and even then it’s very inclined to repeat stuff and degenerates quickly. By contrast, you can just tell 002 what to do, and it will usually at least try to do what you say.
Seems like you’re implying that davinci is the base model for 002 and 003. That’s not the case; davinci has one base model (GPT-3) and then 002 and 003 share a different base model (GPT-3.5).
Fair. I think the crucial question to Ajeya & Matthew’s discussion of “Why the hype now?” is exactly how much worse the non-RLHF models that had been available since at least last March (
davinci
,code-davinci-002
,text-davinci-002
) actually were than the RLHF models made available just recently (text-davinci-003
and ChatGPT’s underlying model). I stand by the opinion that the besides the new chat stuff, most of the improvement happened within the old cohort, rather than between cohorts, so I attribute the recent hype to the convenient and free chat interface.People seem pretty impressed with CharacterAI, which seems to get most of its character-specific info from prompting and having finetuned on roleplay dialog. However, it’s also possible that CharacterAI’s base models are RLHF’d to be consistent roleplayers.
Would love to learn more about the model(s) behind CharacterAI. Anyone know if there’s publicly available information on them?
I think the part where it has a longer memory/coherence feels like a major shift (having gotten into the flow of experimenting with GPT3 in the month prior to chatGPT, I felt like the two interfaces were approximately as convenient)
I don’t know what mechanism was used to generate the longer coherence though.
I don’t think this is related to RLHF.
At least ChatGPT seems to have a longer context window, this experiment suggesting 8192 tokens.