If this is about work related to Q-learning for small models, I would not expect OpenAI to achieve better results than DeepMind, since the latter has much deeper expertise in reinforcement learning, and has been committed to this direction since the start.
It was, but think about how much turnover there has been and how long ago that was. The majority of OAers have been there for maybe a year or two*, never mind 5 years (at the tail end!) or counting defections to Anthropic etc. (And then there is the tech stack: All of that DRL work was on GCP, not Azure. And it was in Tensorflow, not PyTorch. It used RNNs, not Transformers. Essentially, none of that code is usable or even runnable by now without extensive maintenance or rewrites.) While DM has been continuously doing DRL of some sort the entire time, with major projects like AlphaStar or their multi-agent projects.
* possibly actually less than a year, since there’s numbers like ’200-300′ for 2022/2023, while there are 700+ on the letter. Considering that the OA market cap tripled or more in that hiring interval, people there must feel like they won the lottery...
other players will catch up soon, if it’s a simple application of RL to LLM’s.
simple != catch-up-soon
‘Simply apply this RL idea to LLMs’ is much more useless than it seems. People have been struggling to apply RL methods to LLMs, or reinventing them the hard way, for years now; it’s super obvious that just prompting a LLM and then greedily sampling is a hilariously stupid-bad way to sample a LLM, and using better sampling methods has been a major topic of discussion of everyone using LLMs since at least GPT-2. It just somehow manages to work better than almost anything else. Sutskever has been screwing around with self-play and math for years since GPT-2, see GPT-f etc. But all the publicly-known results have been an incremental grind… until now?
So the application may well be simple, perhaps a tiny tweak of an equation somewhere and people will rush to pull up all the obscure work which preceded it and how ‘we knew it all along’*, but that’s an entirely different thing from it being easy to reinvent.
* Cowen’s second law—happened with AlphaZero, incidentally, with ‘expert iteration’. Once they had invented expert iteration from scratch, suddenly, everyone could come up with a dozen papers that it ‘drew on’ and showed that it was ‘obvious’. (Like a nouveau riche buying an aristocratic pedigree.)
I agree, which is why I don’t expect OpenAI to do better. If both teams are tweaking equations here and there, based on their prior work I see DeepMind does it more efficiently. OpenAI has been historically luckier, but luck is not a quality I would extrapolate in time.
You may not expect OA to do better (neither did I, even if I expected someone somewhere to crack the problem within a few years of GPT-3), but that’s not the relevant fact here, now that you have observed that apparently there’s this “Q*” & “Zero” thing that OAers are super-excited about. It was hard work and required luck, but apparently they got lucky. It is what it is. (‘Update your priors using the evidence to obtain a new posterior’, as we like to say around here.)
How much does that help someone else get lucky? Well, it depends on how much they leak or publish. If it’s like the GPT-3 paper, then yeah, people can replicate it quickly and are sufficiently motivated these days that they probably will. If it’s like the GPT-4 “paper”, well… Knowing someone else has won the lottery of tweaking equations at random here & there doesn’t help you win the lottery yourself.
(The fact that self-play or LLM search of some sort works is not that useful—we all knew it has to work somehow! It’s the critical vital details which is the secret sauce that probably matters here. How exactly does their particular variant thread the needle’s eye to avoid diverging or plateauing etc? Remember Karpathy’s law: “neural nets want to work”. So even if your approach is badly broken, it can mislead you for a long time by working better than it has any right to.)
If this is about work related to Q-learning for small models, I would not expect OpenAI to achieve better results than DeepMind, since the latter has much deeper expertise in reinforcement learning, and has been committed to this direction since the start.
OpenAI was full on RL in 2015-18 until transformers been discovered
It was, but think about how much turnover there has been and how long ago that was. The majority of OAers have been there for maybe a year or two*, never mind 5 years (at the tail end!) or counting defections to Anthropic etc. (And then there is the tech stack: All of that DRL work was on GCP, not Azure. And it was in Tensorflow, not PyTorch. It used RNNs, not Transformers. Essentially, none of that code is usable or even runnable by now without extensive maintenance or rewrites.) While DM has been continuously doing DRL of some sort the entire time, with major projects like AlphaStar or their multi-agent projects.
* possibly actually less than a year, since there’s numbers like ’200-300′ for 2022/2023, while there are 700+ on the letter. Considering that the OA market cap tripled or more in that hiring interval, people there must feel like they won the lottery...
John Schulman is at OpenAI.
OpenAI sometimes get there faster. But I think other players will catch up soon, if it’s a simple application of RL to LLM’s.
simple != catch-up-soon
‘Simply apply this RL idea to LLMs’ is much more useless than it seems. People have been struggling to apply RL methods to LLMs, or reinventing them the hard way, for years now; it’s super obvious that just prompting a LLM and then greedily sampling is a hilariously stupid-bad way to sample a LLM, and using better sampling methods has been a major topic of discussion of everyone using LLMs since at least GPT-2. It just somehow manages to work better than almost anything else. Sutskever has been screwing around with self-play and math for years since GPT-2, see GPT-f etc. But all the publicly-known results have been an incremental grind… until now?
So the application may well be simple, perhaps a tiny tweak of an equation somewhere and people will rush to pull up all the obscure work which preceded it and how ‘we knew it all along’*, but that’s an entirely different thing from it being easy to reinvent.
* Cowen’s second law—happened with AlphaZero, incidentally, with ‘expert iteration’. Once they had invented expert iteration from scratch, suddenly, everyone could come up with a dozen papers that it ‘drew on’ and showed that it was ‘obvious’. (Like a nouveau riche buying an aristocratic pedigree.)
I agree, which is why I don’t expect OpenAI to do better. If both teams are tweaking equations here and there, based on their prior work I see DeepMind does it more efficiently. OpenAI has been historically luckier, but luck is not a quality I would extrapolate in time.
You may not expect OA to do better (neither did I, even if I expected someone somewhere to crack the problem within a few years of GPT-3), but that’s not the relevant fact here, now that you have observed that apparently there’s this “Q*” & “Zero” thing that OAers are super-excited about. It was hard work and required luck, but apparently they got lucky. It is what it is. (‘Update your priors using the evidence to obtain a new posterior’, as we like to say around here.)
How much does that help someone else get lucky? Well, it depends on how much they leak or publish. If it’s like the GPT-3 paper, then yeah, people can replicate it quickly and are sufficiently motivated these days that they probably will. If it’s like the GPT-4 “paper”, well… Knowing someone else has won the lottery of tweaking equations at random here & there doesn’t help you win the lottery yourself.
(The fact that self-play or LLM search of some sort works is not that useful—we all knew it has to work somehow! It’s the critical vital details which is the secret sauce that probably matters here. How exactly does their particular variant thread the needle’s eye to avoid diverging or plateauing etc? Remember Karpathy’s law: “neural nets want to work”. So even if your approach is badly broken, it can mislead you for a long time by working better than it has any right to.)