The scaling hypothesis implies that it’ll happen eventually, yes: but the details matter a lot. One way to think of it is Eliezer’s quip: the IQ necessary to destroy the world drops by 1 point per year. Similarly, to do scaling or bitter-lesson-style research, you need resources * fanaticism < a constant. This constant seems to be very small, which is why compute had to drop all the way to ~$1k before any researchers worldwide were fanatical enough to bother trying CNNs and create AlexNet. Countless entities, and companies, could have used this ‘obvious way to do better than everyone else, for market competitiveness’ for years—or decades—before hand. But they didn’t.
For the question of who gets there first, ‘a handful of years’ is decisive. So this is pretty important if you want to think about the current plausible AGI trajectories, which for many people (even excluding individuals like Moravec, or Shane Legg who has projected out to ~2028 for a long time now), have shrunk rapidly to timescales on which ‘a handful of years’ represents a large fraction of the outstanding timeline!
Incidentally, it has now been 86 days since the GPT-3 paper was uploaded, or a quarter of a year. Excluding GShard (which as a sparse model is not at all comparable parameter-wise), as far as I know no one has announced any new (dense) models which are even as large as Turing-NLG—much less larger than GPT-3.
Similarly, to do scaling or bitter-lesson-style research, you need resources * fanaticism < a constant. This constant seems to be very small, which is why compute had to drop
A fairly minor point, but I don’t quite follow the formula / analogy. Don’t resources and fanaticism help you do the scaling research? So shouldn’t it be a > sign rather than <, and shouldn’t we say that the constant is large rather than small?
I agree this makes a large fractional change to some AI timelines, and has significant impacts on questions like ownership. But when considering very short timescales, while I can see OpenAI halting their work would change ownership, presumably to some worse steward, I don’t see the gap being large enough to materially affect alignment research. That is, it’s better OpenAI gets it in 2024 than someone else gets it in 2026.
This constant seems to be very small, which is why compute had to drop all the way to ~$1k before any researchers worldwide were fanatical enough to bother trying CNNs and create AlexNet.
It’s hard to be fanatical when you don’t have results. Nowadays AI is so successful it’s hard to imagine this being a significant impediment.
Excluding GShard (which as a sparse model is not at all comparable parameter-wise)
I wouldn’t dismiss GShard altogether. The parameter counts aren’t equal, but MoE(2048E, 60L) is still a beast, and it opens up room for more scaling than a standard model.
The scaling hypothesis implies that it’ll happen eventually, yes: but the details matter a lot. One way to think of it is Eliezer’s quip: the IQ necessary to destroy the world drops by 1 point per year. Similarly, to do scaling or bitter-lesson-style research, you need
resources * fanaticism < a constant
. This constant seems to be very small, which is why compute had to drop all the way to ~$1k before any researchers worldwide were fanatical enough to bother trying CNNs and create AlexNet. Countless entities, and companies, could have used this ‘obvious way to do better than everyone else, for market competitiveness’ for years—or decades—before hand. But they didn’t.For the question of who gets there first, ‘a handful of years’ is decisive. So this is pretty important if you want to think about the current plausible AGI trajectories, which for many people (even excluding individuals like Moravec, or Shane Legg who has projected out to ~2028 for a long time now), have shrunk rapidly to timescales on which ‘a handful of years’ represents a large fraction of the outstanding timeline!
Incidentally, it has now been 86 days since the GPT-3 paper was uploaded, or a quarter of a year. Excluding GShard (which as a sparse model is not at all comparable parameter-wise), as far as I know no one has announced any new (dense) models which are even as large as Turing-NLG—much less larger than GPT-3.
A fairly minor point, but I don’t quite follow the formula / analogy. Don’t resources and fanaticism help you do the scaling research? So shouldn’t it be a > sign rather than <, and shouldn’t we say that the constant is large rather than small?
I agree this makes a large fractional change to some AI timelines, and has significant impacts on questions like ownership. But when considering very short timescales, while I can see OpenAI halting their work would change ownership, presumably to some worse steward, I don’t see the gap being large enough to materially affect alignment research. That is, it’s better OpenAI gets it in 2024 than someone else gets it in 2026.
It’s hard to be fanatical when you don’t have results. Nowadays AI is so successful it’s hard to imagine this being a significant impediment.
I wouldn’t dismiss GShard altogether. The parameter counts aren’t equal, but MoE(2048E, 60L) is still a beast, and it opens up room for more scaling than a standard model.