Yeah, I do remember NVIDIA claiming they could do 100T param models by 2023. Not a quadrillion though IIRC.
However, (a) this may be just classic overoptimistic bullshit marketing, and thus we should expect it to be off by a couple years, and (b) they may have been including Mixture of Expert models, in which case 100T parameters is much less of a big deal. To my knowledge a 100T parameter MoE model would be a lot cheaper (in terms of compute and thus money) to train than a 100T parameter dense model like GPT, but also the performance would be significantly worse. If I’m wrong about this I’d love to hear why!
Given the timing of Jensen’s remarks about expecting trillion+ models and the subsequent MoEs of Switch & Wudao (1.2t) and embedding-heavy models like DLRM (12t), with dense models still stuck at GPT-3 scale, I’m now sure that he was referring to MoEs/embeddings, so a 100t MoE/embedding is both plausible and also not terribly interesting. (I’m sure Facebook would love to scale up DLRM another 10x and have embeddings for every SKU and Internet user and URL and video and book and song in the world, that sort of thing, but it will mean relatively little for AI capabilities or risk.) After all, he never said they were dense models, and the source in question is marketing, which can be assumed to accentuate the positive.
More generally, it is well past time to drop discussion of parameters, and switch to compute-only as we can create models with more parameters than we can train (you can fit a 100t-param with ZeRo into your cluster? great! how you gonna train it? Just leave it running for the next decade or two?) and we have no shortage of Internet data either: compute, compute, compute! It’ll only get worse if some new architecture with fast weights comes into play, and we have to start counting runtime-generated parameters as ‘parameters’ too. (eg Schmidhuber back in like the ’00s showed off archs which used… Fourier transforms? to have thousands of weights generate hundreds of thousands of weights or something. Think stuff like hypernetworks. ‘Parameter’ will mean even less than it does now.)
Wouldn’t that imply that the trajectory of AI is heavily dependent on how long Moore’s Law lasts, and how well quantum computers do?
Is your model that the jump to GPT-3 scale consumed the hardware overhang, and that we cannot expect meaningful progress on the same time scale in the near future?
GPT-3: it used up a particular kind of overhang you might call the “small-scale industrial CS R&D budget hardware overhang”. (It would certainly be possible to make much greater than GPT-3-level progress, but you’d need vastly larger budgets: say, 10% of a failed erectile-dysfunction drug candidate, or 0.1% of the money it takes to run a failed European fusion reactor or particle collider.) So, I continue to stand by my scaling hypothesis essay’s paradigm that as expected, we saw some imitation and catchup, but no one created a model much bigger than GPT-3, never mind one that was >100x bigger the way GPT-3 was to GPT-2-1.5b, because no one at relevant corporations truly believes in scaling or wishes to commit the necessary resources, or feels that it’s near a crunchtime where there might be a rush to train a model at the edge of the possible, and OA itself has been resting on its laurels as it turns into a SaaS startup. (We’ll see what the Anthropic refugees choose to do with their $124m seed capital, but so far they appear to be making a relaxed start of it as well.)
The overhang GPT-3 used up should not be confused with other overhangs. There are many other hardware overhangs of interest: the hardware overhang of the experience curve where the cost halves every year or two; the hardware overhang of a distilled/compressed/sparsified model; the hardware overhang of the global compute infrastructure available to a rogue agent. The small-scale industrial R&D overhang is the relevant and binding one… for now. But the others become relevant later on, under different circumstances, and many of them keep getting bigger.
Why would QC be irrelevant? Quantum systems don’t perform well on all tasks, but they generally work well for parallel tasks, right? And neural nets are largely parallel. QC isn’t to the point of being able to help yet, but especially if conventional computing becomes a serious bottleneck, it might become important over the next decade.
I think that the only known quantum speedup for relatively generic tasks is from Grover’s algorithm, which only gives a quadratic speedup. That might be significant some day, or not, depending on the cost of quantum hardware. When it comes to superpolynomial speed-ups, it is very much an active field of study which tasks are relevant, and as far as we know it’s only some very specialized tasks like integer factoring. A bunch of people are trying to apply QC to ML but AFAIK it’s still anyone’s guess whether that will end up being significant.
And some of the past QC claims for ML have not panned out. Like, I think there was a Quantum Monte Carlo claimed to be potentially useful for ML which could be done on cheaper QC archs, but then it turned out to be doable classically...? In any case, I have been reading about QCs all my life, and they have yet to become relevant to anything I care about; and I assume Scott Aaronson will alert us should they suddenly become relevant to AI/ML/DL, so the rest of us should go about our lives until that day.
Yeah, I do remember NVIDIA claiming they could do 100T param models by 2023. Not a quadrillion though IIRC.
However, (a) this may be just classic overoptimistic bullshit marketing, and thus we should expect it to be off by a couple years, and (b) they may have been including Mixture of Expert models, in which case 100T parameters is much less of a big deal. To my knowledge a 100T parameter MoE model would be a lot cheaper (in terms of compute and thus money) to train than a 100T parameter dense model like GPT, but also the performance would be significantly worse. If I’m wrong about this I’d love to hear why!
Given the timing of Jensen’s remarks about expecting trillion+ models and the subsequent MoEs of Switch & Wudao (1.2t) and embedding-heavy models like DLRM (12t), with dense models still stuck at GPT-3 scale, I’m now sure that he was referring to MoEs/embeddings, so a 100t MoE/embedding is both plausible and also not terribly interesting. (I’m sure Facebook would love to scale up DLRM another 10x and have embeddings for every SKU and Internet user and URL and video and book and song in the world, that sort of thing, but it will mean relatively little for AI capabilities or risk.) After all, he never said they were dense models, and the source in question is marketing, which can be assumed to accentuate the positive.
More generally, it is well past time to drop discussion of parameters, and switch to compute-only as we can create models with more parameters than we can train (you can fit a 100t-param with ZeRo into your cluster? great! how you gonna train it? Just leave it running for the next decade or two?) and we have no shortage of Internet data either: compute, compute, compute! It’ll only get worse if some new architecture with fast weights comes into play, and we have to start counting runtime-generated parameters as ‘parameters’ too. (eg Schmidhuber back in like the ’00s showed off archs which used… Fourier transforms? to have thousands of weights generate hundreds of thousands of weights or something. Think stuff like hypernetworks. ‘Parameter’ will mean even less than it does now.)
Wouldn’t that imply that the trajectory of AI is heavily dependent on how long Moore’s Law lasts, and how well quantum computers do?
Is your model that the jump to GPT-3 scale consumed the hardware overhang, and that we cannot expect meaningful progress on the same time scale in the near future?
Moore: yes.
QC: AFAIK it’s irrelevant?
GPT-3: it used up a particular kind of overhang you might call the “small-scale industrial CS R&D budget hardware overhang”. (It would certainly be possible to make much greater than GPT-3-level progress, but you’d need vastly larger budgets: say, 10% of a failed erectile-dysfunction drug candidate, or 0.1% of the money it takes to run a failed European fusion reactor or particle collider.) So, I continue to stand by my scaling hypothesis essay’s paradigm that as expected, we saw some imitation and catchup, but no one created a model much bigger than GPT-3, never mind one that was >100x bigger the way GPT-3 was to GPT-2-1.5b, because no one at relevant corporations truly believes in scaling or wishes to commit the necessary resources, or feels that it’s near a crunchtime where there might be a rush to train a model at the edge of the possible, and OA itself has been resting on its laurels as it turns into a SaaS startup. (We’ll see what the Anthropic refugees choose to do with their $124m seed capital, but so far they appear to be making a relaxed start of it as well.)
The overhang GPT-3 used up should not be confused with other overhangs. There are many other hardware overhangs of interest: the hardware overhang of the experience curve where the cost halves every year or two; the hardware overhang of a distilled/compressed/sparsified model; the hardware overhang of the global compute infrastructure available to a rogue agent. The small-scale industrial R&D overhang is the relevant and binding one… for now. But the others become relevant later on, under different circumstances, and many of them keep getting bigger.
Why would QC be irrelevant? Quantum systems don’t perform well on all tasks, but they generally work well for parallel tasks, right? And neural nets are largely parallel. QC isn’t to the point of being able to help yet, but especially if conventional computing becomes a serious bottleneck, it might become important over the next decade.
I think that the only known quantum speedup for relatively generic tasks is from Grover’s algorithm, which only gives a quadratic speedup. That might be significant some day, or not, depending on the cost of quantum hardware. When it comes to superpolynomial speed-ups, it is very much an active field of study which tasks are relevant, and as far as we know it’s only some very specialized tasks like integer factoring. A bunch of people are trying to apply QC to ML but AFAIK it’s still anyone’s guess whether that will end up being significant.
And some of the past QC claims for ML have not panned out. Like, I think there was a Quantum Monte Carlo claimed to be potentially useful for ML which could be done on cheaper QC archs, but then it turned out to be doable classically...? In any case, I have been reading about QCs all my life, and they have yet to become relevant to anything I care about; and I assume Scott Aaronson will alert us should they suddenly become relevant to AI/ML/DL, so the rest of us should go about our lives until that day.