Human scale neural nets would be 3 OOMs bigger than GPT-3, a quadrillion parameters would be 1 OOM bigger still. According to the scaling laws and empirical compute-optimal scaling trends, it seems that anyone training a net 3 OOMs bigger than GPT-3 would also train it for, ike, 2 OOMs longer, for a total of +5 OOMs of compute. For a quadrillion-parameter model, we’re looking at +6 OOMs or so.
There’s just no way that’s possible by 2023. GPT-3 costs millions of dollars of compute to train, apparently. +6 OOMs would be trillions. Presumably algorithmic breakthroughs will lower the training cost a bit, and hardware improvements will lower the compute cost, but I highly doubt we’d get 3 OOMs of lower cost by 2023. So we’re looking at a 10-billion-dollar price tag, give or take. I highly doubt anyone will be spending that much in 2023, and even if someone did, I am skeptical that the computing infrastructure for such a thing will have been built in time. I don’t think there are compute clusters a thousand times bigger than the one GPT-3 was trained on (though I might be wrong) and even if we were, to achieve your prediction we’d need it to be tens or hundreds of thousands of times bigger.
On alignment optimism: As I see it, three things need to happen for alignment to succeed.
1. A company that is sympathetic to alignment concerns has to have a significant lead-time over everyone else (before someone replicates or steals code etc.), so that they can do the necessary extra work and spend the extra time and money needed to implement an alignment solution.
2. A solution needs to be found that can be implemented in that amount of lead-time.
3. This solution needs to be actually chosen and implemented by the company, rather than some other, more appealing but incorrect solution chosen instead. (There will be dozens of self-experts pitching dozens of proposed solutions to the problem, each of which will be incorrect by default. The correct one needs to actually rise to the top in the eyes of the company leaders, which is hard since the company leaders don’t know much alignment literature and may not be able to judge good from bad solutions.)
On 1: In my opinion there are only 3 major AI projects sympathetic to alignment concerns, and the pace of progress is such (and the state of security is such) that they’ll probably have less than six months of lead time.
On 2: In my opinion we are not at all close to finding a solution that works even in principle; finding one that works in six months is even harder.
On 3: In my opinion there is only 1 major AI project that has a good chance of distinguishing viable solutions from fake solutions, and actually implementing it rather than dragging feet or convincing themselves that the danger is still in the future and not now. (e.g. “takeoff is supposed to be slow, we haven’t seen any warning shots yet, this system can’t be that dangerous yet”)
Currently, I think the probability of all three things happening seems to be <1%. Happily there’s model uncertainty, unknown unknowns, etc. which is why I’m not quite that pessimistic. But still, it’s pretty scary.
Trillions of dollars for +6 OOMs is not something people are likely to be willing to spend by 2023. On the other hand, part of the reason that neural net sizes have consistently increased by one to two OOMs per year lately is due to advances in running them cheaply. Programs like Microsoft’s ZeRO system aim explicitly at creating nets on the hundred trillion-parameter scale at an acceptable price. Certainly there’s uncertainty around how well it will work, and whether it will be extended to a quadrillion parameters even if it does, but parts of the industry appear to believe it’s practical.
Yeah, I do remember NVIDIA claiming they could do 100T param models by 2023. Not a quadrillion though IIRC.
However, (a) this may be just classic overoptimistic bullshit marketing, and thus we should expect it to be off by a couple years, and (b) they may have been including Mixture of Expert models, in which case 100T parameters is much less of a big deal. To my knowledge a 100T parameter MoE model would be a lot cheaper (in terms of compute and thus money) to train than a 100T parameter dense model like GPT, but also the performance would be significantly worse. If I’m wrong about this I’d love to hear why!
Given the timing of Jensen’s remarks about expecting trillion+ models and the subsequent MoEs of Switch & Wudao (1.2t) and embedding-heavy models like DLRM (12t), with dense models still stuck at GPT-3 scale, I’m now sure that he was referring to MoEs/embeddings, so a 100t MoE/embedding is both plausible and also not terribly interesting. (I’m sure Facebook would love to scale up DLRM another 10x and have embeddings for every SKU and Internet user and URL and video and book and song in the world, that sort of thing, but it will mean relatively little for AI capabilities or risk.) After all, he never said they were dense models, and the source in question is marketing, which can be assumed to accentuate the positive.
More generally, it is well past time to drop discussion of parameters, and switch to compute-only as we can create models with more parameters than we can train (you can fit a 100t-param with ZeRo into your cluster? great! how you gonna train it? Just leave it running for the next decade or two?) and we have no shortage of Internet data either: compute, compute, compute! It’ll only get worse if some new architecture with fast weights comes into play, and we have to start counting runtime-generated parameters as ‘parameters’ too. (eg Schmidhuber back in like the ’00s showed off archs which used… Fourier transforms? to have thousands of weights generate hundreds of thousands of weights or something. Think stuff like hypernetworks. ‘Parameter’ will mean even less than it does now.)
Wouldn’t that imply that the trajectory of AI is heavily dependent on how long Moore’s Law lasts, and how well quantum computers do?
Is your model that the jump to GPT-3 scale consumed the hardware overhang, and that we cannot expect meaningful progress on the same time scale in the near future?
GPT-3: it used up a particular kind of overhang you might call the “small-scale industrial CS R&D budget hardware overhang”. (It would certainly be possible to make much greater than GPT-3-level progress, but you’d need vastly larger budgets: say, 10% of a failed erectile-dysfunction drug candidate, or 0.1% of the money it takes to run a failed European fusion reactor or particle collider.) So, I continue to stand by my scaling hypothesis essay’s paradigm that as expected, we saw some imitation and catchup, but no one created a model much bigger than GPT-3, never mind one that was >100x bigger the way GPT-3 was to GPT-2-1.5b, because no one at relevant corporations truly believes in scaling or wishes to commit the necessary resources, or feels that it’s near a crunchtime where there might be a rush to train a model at the edge of the possible, and OA itself has been resting on its laurels as it turns into a SaaS startup. (We’ll see what the Anthropic refugees choose to do with their $124m seed capital, but so far they appear to be making a relaxed start of it as well.)
The overhang GPT-3 used up should not be confused with other overhangs. There are many other hardware overhangs of interest: the hardware overhang of the experience curve where the cost halves every year or two; the hardware overhang of a distilled/compressed/sparsified model; the hardware overhang of the global compute infrastructure available to a rogue agent. The small-scale industrial R&D overhang is the relevant and binding one… for now. But the others become relevant later on, under different circumstances, and many of them keep getting bigger.
Why would QC be irrelevant? Quantum systems don’t perform well on all tasks, but they generally work well for parallel tasks, right? And neural nets are largely parallel. QC isn’t to the point of being able to help yet, but especially if conventional computing becomes a serious bottleneck, it might become important over the next decade.
I think that the only known quantum speedup for relatively generic tasks is from Grover’s algorithm, which only gives a quadratic speedup. That might be significant some day, or not, depending on the cost of quantum hardware. When it comes to superpolynomial speed-ups, it is very much an active field of study which tasks are relevant, and as far as we know it’s only some very specialized tasks like integer factoring. A bunch of people are trying to apply QC to ML but AFAIK it’s still anyone’s guess whether that will end up being significant.
And some of the past QC claims for ML have not panned out. Like, I think there was a Quantum Monte Carlo claimed to be potentially useful for ML which could be done on cheaper QC archs, but then it turned out to be doable classically...? In any case, I have been reading about QCs all my life, and they have yet to become relevant to anything I care about; and I assume Scott Aaronson will alert us should they suddenly become relevant to AI/ML/DL, so the rest of us should go about our lives until that day.
Human scale neural nets would be 3 OOMs bigger than GPT-3, a quadrillion parameters would be 1 OOM bigger still. According to the scaling laws and empirical compute-optimal scaling trends, it seems that anyone training a net 3 OOMs bigger than GPT-3 would also train it for, ike, 2 OOMs longer, for a total of +5 OOMs of compute. For a quadrillion-parameter model, we’re looking at +6 OOMs or so.
There’s just no way that’s possible by 2023. GPT-3 costs millions of dollars of compute to train, apparently. +6 OOMs would be trillions. Presumably algorithmic breakthroughs will lower the training cost a bit, and hardware improvements will lower the compute cost, but I highly doubt we’d get 3 OOMs of lower cost by 2023. So we’re looking at a 10-billion-dollar price tag, give or take. I highly doubt anyone will be spending that much in 2023, and even if someone did, I am skeptical that the computing infrastructure for such a thing will have been built in time. I don’t think there are compute clusters a thousand times bigger than the one GPT-3 was trained on (though I might be wrong) and even if we were, to achieve your prediction we’d need it to be tens or hundreds of thousands of times bigger.
On alignment optimism: As I see it, three things need to happen for alignment to succeed.
1. A company that is sympathetic to alignment concerns has to have a significant lead-time over everyone else (before someone replicates or steals code etc.), so that they can do the necessary extra work and spend the extra time and money needed to implement an alignment solution.
2. A solution needs to be found that can be implemented in that amount of lead-time.
3. This solution needs to be actually chosen and implemented by the company, rather than some other, more appealing but incorrect solution chosen instead. (There will be dozens of self-experts pitching dozens of proposed solutions to the problem, each of which will be incorrect by default. The correct one needs to actually rise to the top in the eyes of the company leaders, which is hard since the company leaders don’t know much alignment literature and may not be able to judge good from bad solutions.)
On 1: In my opinion there are only 3 major AI projects sympathetic to alignment concerns, and the pace of progress is such (and the state of security is such) that they’ll probably have less than six months of lead time.
On 2: In my opinion we are not at all close to finding a solution that works even in principle; finding one that works in six months is even harder.
On 3: In my opinion there is only 1 major AI project that has a good chance of distinguishing viable solutions from fake solutions, and actually implementing it rather than dragging feet or convincing themselves that the danger is still in the future and not now. (e.g. “takeoff is supposed to be slow, we haven’t seen any warning shots yet, this system can’t be that dangerous yet”)
Currently, I think the probability of all three things happening seems to be <1%. Happily there’s model uncertainty, unknown unknowns, etc. which is why I’m not quite that pessimistic. But still, it’s pretty scary.
Trillions of dollars for +6 OOMs is not something people are likely to be willing to spend by 2023. On the other hand, part of the reason that neural net sizes have consistently increased by one to two OOMs per year lately is due to advances in running them cheaply. Programs like Microsoft’s ZeRO system aim explicitly at creating nets on the hundred trillion-parameter scale at an acceptable price. Certainly there’s uncertainty around how well it will work, and whether it will be extended to a quadrillion parameters even if it does, but parts of the industry appear to believe it’s practical.
Yeah, I do remember NVIDIA claiming they could do 100T param models by 2023. Not a quadrillion though IIRC.
However, (a) this may be just classic overoptimistic bullshit marketing, and thus we should expect it to be off by a couple years, and (b) they may have been including Mixture of Expert models, in which case 100T parameters is much less of a big deal. To my knowledge a 100T parameter MoE model would be a lot cheaper (in terms of compute and thus money) to train than a 100T parameter dense model like GPT, but also the performance would be significantly worse. If I’m wrong about this I’d love to hear why!
Given the timing of Jensen’s remarks about expecting trillion+ models and the subsequent MoEs of Switch & Wudao (1.2t) and embedding-heavy models like DLRM (12t), with dense models still stuck at GPT-3 scale, I’m now sure that he was referring to MoEs/embeddings, so a 100t MoE/embedding is both plausible and also not terribly interesting. (I’m sure Facebook would love to scale up DLRM another 10x and have embeddings for every SKU and Internet user and URL and video and book and song in the world, that sort of thing, but it will mean relatively little for AI capabilities or risk.) After all, he never said they were dense models, and the source in question is marketing, which can be assumed to accentuate the positive.
More generally, it is well past time to drop discussion of parameters, and switch to compute-only as we can create models with more parameters than we can train (you can fit a 100t-param with ZeRo into your cluster? great! how you gonna train it? Just leave it running for the next decade or two?) and we have no shortage of Internet data either: compute, compute, compute! It’ll only get worse if some new architecture with fast weights comes into play, and we have to start counting runtime-generated parameters as ‘parameters’ too. (eg Schmidhuber back in like the ’00s showed off archs which used… Fourier transforms? to have thousands of weights generate hundreds of thousands of weights or something. Think stuff like hypernetworks. ‘Parameter’ will mean even less than it does now.)
Wouldn’t that imply that the trajectory of AI is heavily dependent on how long Moore’s Law lasts, and how well quantum computers do?
Is your model that the jump to GPT-3 scale consumed the hardware overhang, and that we cannot expect meaningful progress on the same time scale in the near future?
Moore: yes.
QC: AFAIK it’s irrelevant?
GPT-3: it used up a particular kind of overhang you might call the “small-scale industrial CS R&D budget hardware overhang”. (It would certainly be possible to make much greater than GPT-3-level progress, but you’d need vastly larger budgets: say, 10% of a failed erectile-dysfunction drug candidate, or 0.1% of the money it takes to run a failed European fusion reactor or particle collider.) So, I continue to stand by my scaling hypothesis essay’s paradigm that as expected, we saw some imitation and catchup, but no one created a model much bigger than GPT-3, never mind one that was >100x bigger the way GPT-3 was to GPT-2-1.5b, because no one at relevant corporations truly believes in scaling or wishes to commit the necessary resources, or feels that it’s near a crunchtime where there might be a rush to train a model at the edge of the possible, and OA itself has been resting on its laurels as it turns into a SaaS startup. (We’ll see what the Anthropic refugees choose to do with their $124m seed capital, but so far they appear to be making a relaxed start of it as well.)
The overhang GPT-3 used up should not be confused with other overhangs. There are many other hardware overhangs of interest: the hardware overhang of the experience curve where the cost halves every year or two; the hardware overhang of a distilled/compressed/sparsified model; the hardware overhang of the global compute infrastructure available to a rogue agent. The small-scale industrial R&D overhang is the relevant and binding one… for now. But the others become relevant later on, under different circumstances, and many of them keep getting bigger.
Why would QC be irrelevant? Quantum systems don’t perform well on all tasks, but they generally work well for parallel tasks, right? And neural nets are largely parallel. QC isn’t to the point of being able to help yet, but especially if conventional computing becomes a serious bottleneck, it might become important over the next decade.
I think that the only known quantum speedup for relatively generic tasks is from Grover’s algorithm, which only gives a quadratic speedup. That might be significant some day, or not, depending on the cost of quantum hardware. When it comes to superpolynomial speed-ups, it is very much an active field of study which tasks are relevant, and as far as we know it’s only some very specialized tasks like integer factoring. A bunch of people are trying to apply QC to ML but AFAIK it’s still anyone’s guess whether that will end up being significant.
And some of the past QC claims for ML have not panned out. Like, I think there was a Quantum Monte Carlo claimed to be potentially useful for ML which could be done on cheaper QC archs, but then it turned out to be doable classically...? In any case, I have been reading about QCs all my life, and they have yet to become relevant to anything I care about; and I assume Scott Aaronson will alert us should they suddenly become relevant to AI/ML/DL, so the rest of us should go about our lives until that day.