If OpenAI changed direction tomorrow, how long would that slow the progress to larger models? I can’t see it lasting; the field of AI is already incessantly moving towards scale, and big models are better. Even in a counterfactual where OpenAI never started scaling models, is this really something that no other company can gradient descent on? Models were getting bigger without OpenAI, and the hardware to do it at scale is getting cheaper.
Well, if we take this comment by gwern at face value, it clearly seems that no one with the actual resources has any interest in doing it for now. Based on these premises, scaling towards incredibly larger models would probably not have happened for years.
So I do think that if you believe this is wrong, you should be able to show where gwern’s comment is wrong.
Gwern’s claim is that these other institutions won’t scale up as a consequence of believing the scaling hypothesis; that is, they won’t bet on it as a path to AGI, and thus won’t spend this money on abstract of philosophical grounds.
My point is that this only matters on short-term scales. None of these companies are blind to the obvious conclusion that bigger models are better. The difference between a hundred-trillion dollar payout and a hundred-million dollar payout is philosophical when you’re talking about justifying <$5m investments. NVIDIA trained an 8.3 B parameter model as practically an afterthought. I get the impression Microsoft’s 17 B parameter Turing-NLG was basically trained to test DeepSpeed. As markets open up to exploit the power of these larger models, the money spent on model scaling is going to continue to rise.
These companies aren’t competing with OpenAI. They’ve built these incredibly powerful systems incidentally, because it’s the obvious way to do better than everyone else. It’s a tool they use for market competitiveness, not as a fundamental insight into the nature of intelligence. OpenAI’s key differentiator is only that they view scale as integral and explanatory, rather than an incidental nuisance.
With this insight, OpenAI can make moonshots that the others can’t: build a huge model, scale it up, and throw money at it. Without this understanding, others will only get there piecewise, scaling up one paper at a time. The delta between the two is at best a handful of years.
The scaling hypothesis implies that it’ll happen eventually, yes: but the details matter a lot. One way to think of it is Eliezer’s quip: the IQ necessary to destroy the world drops by 1 point per year. Similarly, to do scaling or bitter-lesson-style research, you need resources * fanaticism < a constant. This constant seems to be very small, which is why compute had to drop all the way to ~$1k before any researchers worldwide were fanatical enough to bother trying CNNs and create AlexNet. Countless entities, and companies, could have used this ‘obvious way to do better than everyone else, for market competitiveness’ for years—or decades—before hand. But they didn’t.
For the question of who gets there first, ‘a handful of years’ is decisive. So this is pretty important if you want to think about the current plausible AGI trajectories, which for many people (even excluding individuals like Moravec, or Shane Legg who has projected out to ~2028 for a long time now), have shrunk rapidly to timescales on which ‘a handful of years’ represents a large fraction of the outstanding timeline!
Incidentally, it has now been 86 days since the GPT-3 paper was uploaded, or a quarter of a year. Excluding GShard (which as a sparse model is not at all comparable parameter-wise), as far as I know no one has announced any new (dense) models which are even as large as Turing-NLG—much less larger than GPT-3.
Similarly, to do scaling or bitter-lesson-style research, you need resources * fanaticism < a constant. This constant seems to be very small, which is why compute had to drop
A fairly minor point, but I don’t quite follow the formula / analogy. Don’t resources and fanaticism help you do the scaling research? So shouldn’t it be a > sign rather than <, and shouldn’t we say that the constant is large rather than small?
I agree this makes a large fractional change to some AI timelines, and has significant impacts on questions like ownership. But when considering very short timescales, while I can see OpenAI halting their work would change ownership, presumably to some worse steward, I don’t see the gap being large enough to materially affect alignment research. That is, it’s better OpenAI gets it in 2024 than someone else gets it in 2026.
This constant seems to be very small, which is why compute had to drop all the way to ~$1k before any researchers worldwide were fanatical enough to bother trying CNNs and create AlexNet.
It’s hard to be fanatical when you don’t have results. Nowadays AI is so successful it’s hard to imagine this being a significant impediment.
Excluding GShard (which as a sparse model is not at all comparable parameter-wise)
I wouldn’t dismiss GShard altogether. The parameter counts aren’t equal, but MoE(2048E, 60L) is still a beast, and it opens up room for more scaling than a standard model.
If OpenAI changed direction tomorrow, how long would that slow the progress to larger models? I can’t see it lasting; the field of AI is already incessantly moving towards scale, and big models are better. Even in a counterfactual where OpenAI never started scaling models, is this really something that no other company can gradient descent on? Models were getting bigger without OpenAI, and the hardware to do it at scale is getting cheaper.
Well, if we take this comment by gwern at face value, it clearly seems that no one with the actual resources has any interest in doing it for now. Based on these premises, scaling towards incredibly larger models would probably not have happened for years.
So I do think that if you believe this is wrong, you should be able to show where gwern’s comment is wrong.
Gwern’s claim is that these other institutions won’t scale up as a consequence of believing the scaling hypothesis; that is, they won’t bet on it as a path to AGI, and thus won’t spend this money on abstract of philosophical grounds.
My point is that this only matters on short-term scales. None of these companies are blind to the obvious conclusion that bigger models are better. The difference between a hundred-trillion dollar payout and a hundred-million dollar payout is philosophical when you’re talking about justifying <$5m investments. NVIDIA trained an 8.3 B parameter model as practically an afterthought. I get the impression Microsoft’s 17 B parameter Turing-NLG was basically trained to test DeepSpeed. As markets open up to exploit the power of these larger models, the money spent on model scaling is going to continue to rise.
These companies aren’t competing with OpenAI. They’ve built these incredibly powerful systems incidentally, because it’s the obvious way to do better than everyone else. It’s a tool they use for market competitiveness, not as a fundamental insight into the nature of intelligence. OpenAI’s key differentiator is only that they view scale as integral and explanatory, rather than an incidental nuisance.
With this insight, OpenAI can make moonshots that the others can’t: build a huge model, scale it up, and throw money at it. Without this understanding, others will only get there piecewise, scaling up one paper at a time. The delta between the two is at best a handful of years.
The scaling hypothesis implies that it’ll happen eventually, yes: but the details matter a lot. One way to think of it is Eliezer’s quip: the IQ necessary to destroy the world drops by 1 point per year. Similarly, to do scaling or bitter-lesson-style research, you need
resources * fanaticism < a constant
. This constant seems to be very small, which is why compute had to drop all the way to ~$1k before any researchers worldwide were fanatical enough to bother trying CNNs and create AlexNet. Countless entities, and companies, could have used this ‘obvious way to do better than everyone else, for market competitiveness’ for years—or decades—before hand. But they didn’t.For the question of who gets there first, ‘a handful of years’ is decisive. So this is pretty important if you want to think about the current plausible AGI trajectories, which for many people (even excluding individuals like Moravec, or Shane Legg who has projected out to ~2028 for a long time now), have shrunk rapidly to timescales on which ‘a handful of years’ represents a large fraction of the outstanding timeline!
Incidentally, it has now been 86 days since the GPT-3 paper was uploaded, or a quarter of a year. Excluding GShard (which as a sparse model is not at all comparable parameter-wise), as far as I know no one has announced any new (dense) models which are even as large as Turing-NLG—much less larger than GPT-3.
A fairly minor point, but I don’t quite follow the formula / analogy. Don’t resources and fanaticism help you do the scaling research? So shouldn’t it be a > sign rather than <, and shouldn’t we say that the constant is large rather than small?
I agree this makes a large fractional change to some AI timelines, and has significant impacts on questions like ownership. But when considering very short timescales, while I can see OpenAI halting their work would change ownership, presumably to some worse steward, I don’t see the gap being large enough to materially affect alignment research. That is, it’s better OpenAI gets it in 2024 than someone else gets it in 2026.
It’s hard to be fanatical when you don’t have results. Nowadays AI is so successful it’s hard to imagine this being a significant impediment.
I wouldn’t dismiss GShard altogether. The parameter counts aren’t equal, but MoE(2048E, 60L) is still a beast, and it opens up room for more scaling than a standard model.