other players will catch up soon, if it’s a simple application of RL to LLM’s.
simple != catch-up-soon
‘Simply apply this RL idea to LLMs’ is much more useless than it seems. People have been struggling to apply RL methods to LLMs, or reinventing them the hard way, for years now; it’s super obvious that just prompting a LLM and then greedily sampling is a hilariously stupid-bad way to sample a LLM, and using better sampling methods has been a major topic of discussion of everyone using LLMs since at least GPT-2. It just somehow manages to work better than almost anything else. Sutskever has been screwing around with self-play and math for years since GPT-2, see GPT-f etc. But all the publicly-known results have been an incremental grind… until now?
So the application may well be simple, perhaps a tiny tweak of an equation somewhere and people will rush to pull up all the obscure work which preceded it and how ‘we knew it all along’*, but that’s an entirely different thing from it being easy to reinvent.
* Cowen’s second law—happened with AlphaZero, incidentally, with ‘expert iteration’. Once they had invented expert iteration from scratch, suddenly, everyone could come up with a dozen papers that it ‘drew on’ and showed that it was ‘obvious’. (Like a nouveau riche buying an aristocratic pedigree.)
I agree, which is why I don’t expect OpenAI to do better. If both teams are tweaking equations here and there, based on their prior work I see DeepMind does it more efficiently. OpenAI has been historically luckier, but luck is not a quality I would extrapolate in time.
You may not expect OA to do better (neither did I, even if I expected someone somewhere to crack the problem within a few years of GPT-3), but that’s not the relevant fact here, now that you have observed that apparently there’s this “Q*” & “Zero” thing that OAers are super-excited about. It was hard work and required luck, but apparently they got lucky. It is what it is. (‘Update your priors using the evidence to obtain a new posterior’, as we like to say around here.)
How much does that help someone else get lucky? Well, it depends on how much they leak or publish. If it’s like the GPT-3 paper, then yeah, people can replicate it quickly and are sufficiently motivated these days that they probably will. If it’s like the GPT-4 “paper”, well… Knowing someone else has won the lottery of tweaking equations at random here & there doesn’t help you win the lottery yourself.
(The fact that self-play or LLM search of some sort works is not that useful—we all knew it has to work somehow! It’s the critical vital details which is the secret sauce that probably matters here. How exactly does their particular variant thread the needle’s eye to avoid diverging or plateauing etc? Remember Karpathy’s law: “neural nets want to work”. So even if your approach is badly broken, it can mislead you for a long time by working better than it has any right to.)
OpenAI sometimes get there faster. But I think other players will catch up soon, if it’s a simple application of RL to LLM’s.
simple != catch-up-soon
‘Simply apply this RL idea to LLMs’ is much more useless than it seems. People have been struggling to apply RL methods to LLMs, or reinventing them the hard way, for years now; it’s super obvious that just prompting a LLM and then greedily sampling is a hilariously stupid-bad way to sample a LLM, and using better sampling methods has been a major topic of discussion of everyone using LLMs since at least GPT-2. It just somehow manages to work better than almost anything else. Sutskever has been screwing around with self-play and math for years since GPT-2, see GPT-f etc. But all the publicly-known results have been an incremental grind… until now?
So the application may well be simple, perhaps a tiny tweak of an equation somewhere and people will rush to pull up all the obscure work which preceded it and how ‘we knew it all along’*, but that’s an entirely different thing from it being easy to reinvent.
* Cowen’s second law—happened with AlphaZero, incidentally, with ‘expert iteration’. Once they had invented expert iteration from scratch, suddenly, everyone could come up with a dozen papers that it ‘drew on’ and showed that it was ‘obvious’. (Like a nouveau riche buying an aristocratic pedigree.)
I agree, which is why I don’t expect OpenAI to do better. If both teams are tweaking equations here and there, based on their prior work I see DeepMind does it more efficiently. OpenAI has been historically luckier, but luck is not a quality I would extrapolate in time.
You may not expect OA to do better (neither did I, even if I expected someone somewhere to crack the problem within a few years of GPT-3), but that’s not the relevant fact here, now that you have observed that apparently there’s this “Q*” & “Zero” thing that OAers are super-excited about. It was hard work and required luck, but apparently they got lucky. It is what it is. (‘Update your priors using the evidence to obtain a new posterior’, as we like to say around here.)
How much does that help someone else get lucky? Well, it depends on how much they leak or publish. If it’s like the GPT-3 paper, then yeah, people can replicate it quickly and are sufficiently motivated these days that they probably will. If it’s like the GPT-4 “paper”, well… Knowing someone else has won the lottery of tweaking equations at random here & there doesn’t help you win the lottery yourself.
(The fact that self-play or LLM search of some sort works is not that useful—we all knew it has to work somehow! It’s the critical vital details which is the secret sauce that probably matters here. How exactly does their particular variant thread the needle’s eye to avoid diverging or plateauing etc? Remember Karpathy’s law: “neural nets want to work”. So even if your approach is badly broken, it can mislead you for a long time by working better than it has any right to.)