One thing we have to account for is advances architecture even in a world where Moore’s law is dead, to what extent memory bandwidth is a constraint on model size, etc. You could rephrase this as how much of an “architecture overhang” exists. One frame to view this through is in era the of Moore’s law we sort of banked a lot of parallel architectural advances as we lacked a good use case for such things. We now have such a use case. So the question is how much performance is sitting in the bank, waiting to be pulled out in the next 5 years.
I don’t know how seriously to take the AI ASIC people, but they are claiming very large increases in capability, on the order of 100-1000x in the next 10 years, if this is a true this is a multiplier on top of increased investment. See this response from a panel including big-wigs at NVIDIA, Google, and Cerebras about projected capabilities: https://youtu.be/E__85F_vnmU?t=4016. On top of this, one has to account, too, for algorithmic advancement: https://openai.com/blog/ai-and-efficiency/
Another thing to note is though by parameter count, the largest modern models are 10000x smaller than the human brain, if one buys that parameter >= synapse idea (which most don’t but is not entirely off the table), the temporal resolution is far higher. So once we get human-sized models, they may be trained almost comically faster than human minds are. So on top an architecture overhang we may have this “temporal resolution overhang”, too, where once models are as powerful as the human brain they will almost certainly be trained much faster. And on top of this there is an “inference overhang” where because inference is much, much cheaper than training, once you are done training an economically useful model, you will almost tautologically have a lot of compute to exploit it with.
Hopefully I am just being paranoid (I am definitely more of a squib than a wizard in these domains), but I am seeing overhangs everywhere!
As an aside, though it’s not mentioned in the paper, I feel like this could be because the scaling analysis was done on 1024-token sequences. Maybe longer sequences can go further. More likely I’m misunderstanding something.
The GPT architecture isn’t even close to being the best Transformer architecture anyway. As an example, someone benchmarked XLNet (over a year old) last week (which has recurrency, one of the ways to break GPT’s context window bottleneck), and it achieves ~10x better parameter efficiency (a 0.4b-parameter XLNet model ~ 5b GPT-3 model) at the few-shot meta-learning task he tried.
Expanding to 2048 BPEs probably buys GPT-3 more headroom (more useful data to learn from, and more for the meta-learning to condition on), and expanding to efficient attentions/recurrency/memory will enable even better prediction performance, with unknown meta-learning or generalization consequences.
(The problem there is the tradeoff between compute efficiency of training and better architectures. It’s not obvious where you want to go: GShard, for example, takes the POV that even GPT is too fancy and slow and inefficient to train on existing hardware, and goes with the even more drastically parameter-inefficient—but efficient to train on GPUs! - mixture-of-expert small Transformers approach.)
Moore’s Law might die in the short future, but I’ve yet to hear a convincing argument for when or why. Even if it does die, Cerebras presumably has at least 4 node shrinks left in the short term (16nm→10nm→7nm→5nm→3nm) for a >10x density scaling, and many sister technologies (3D stacking, silicon photonics, new non-volatile memories, cheaper fab tech) are far from exhausted. One can easily imagine a 3nm Cerebras waffle coated with a few layers of Nantero’s NRAM, with a few hundred of these connected together using low-latency silicon photonics. That would easily train quadrillion parameter models, using only technology already on our roadmap.
Alas, the nature of technology is that while there are many potential avenues for revolutionary improvement, only some small fraction of them win. So it’s probably wrong to look at any specific unproven technology as a given path to 10,000x scaling. But there are a lot of similarly revolutionary technologies, and so it’s much harder to say they will all fail.
Your estimates of hardware advancement seem higher than most people’s. I’ve enjoyed your comments on such things and think there should be a high-level, full length post on them, especially with widely respected posts claiming much longer times until human-level hardware.Would be willing to subsidize such a thing if you are interested. Would pay 500 USD to yourself or a charity of your choice for a post on the potential of ASICS, Moore’s law, how quickly we can overcome the memory bandwidth bottlenecks and such things. Would also subsidize a post estimating an answer this question, too: https://www.lesswrong.com/posts/7htxRA4TkHERiuPYK/parameter-vs-synapse
Is density even relevant when your computations can be run in parallel? I feel like price-performance will be the only relevant measure, even if that means slower clock cycles.
Density is important because it affects both price and communication speed. These are the fundamental roadblocks to building larger models. If you scale to too large clusters of computers, or primarily use high-density off-chip memory, you spend most of your time waiting for data to arrive in the right place.
One thing we have to account for is advances architecture even in a world where Moore’s law is dead, to what extent memory bandwidth is a constraint on model size, etc. You could rephrase this as how much of an “architecture overhang” exists. One frame to view this through is in era the of Moore’s law we sort of banked a lot of parallel architectural advances as we lacked a good use case for such things. We now have such a use case. So the question is how much performance is sitting in the bank, waiting to be pulled out in the next 5 years.
I don’t know how seriously to take the AI ASIC people, but they are claiming very large increases in capability, on the order of 100-1000x in the next 10 years, if this is a true this is a multiplier on top of increased investment. See this response from a panel including big-wigs at NVIDIA, Google, and Cerebras about projected capabilities: https://youtu.be/E__85F_vnmU?t=4016. On top of this, one has to account, too, for algorithmic advancement: https://openai.com/blog/ai-and-efficiency/
Another thing to note is though by parameter count, the largest modern models are 10000x smaller than the human brain, if one buys that parameter >= synapse idea (which most don’t but is not entirely off the table), the temporal resolution is far higher. So once we get human-sized models, they may be trained almost comically faster than human minds are. So on top an architecture overhang we may have this “temporal resolution overhang”, too, where once models are as powerful as the human brain they will almost certainly be trained much faster. And on top of this there is an “inference overhang” where because inference is much, much cheaper than training, once you are done training an economically useful model, you will almost tautologically have a lot of compute to exploit it with.
Hopefully I am just being paranoid (I am definitely more of a squib than a wizard in these domains), but I am seeing overhangs everywhere!
The GPT architecture isn’t even close to being the best Transformer architecture anyway. As an example, someone benchmarked XLNet (over a year old) last week (which has recurrency, one of the ways to break GPT’s context window bottleneck), and it achieves ~10x better parameter efficiency (a 0.4b-parameter XLNet model ~ 5b GPT-3 model) at the few-shot meta-learning task he tried.
Expanding to 2048 BPEs probably buys GPT-3 more headroom (more useful data to learn from, and more for the meta-learning to condition on), and expanding to efficient attentions/recurrency/memory will enable even better prediction performance, with unknown meta-learning or generalization consequences.
(The problem there is the tradeoff between compute efficiency of training and better architectures. It’s not obvious where you want to go: GShard, for example, takes the POV that even GPT is too fancy and slow and inefficient to train on existing hardware, and goes with the even more drastically parameter-inefficient—but efficient to train on GPUs! - mixture-of-expert small Transformers approach.)
Moore’s Law is not dead. I could rant about the market dynamics that made people think otherwise, but it’s easier just to point to the data.
https://docs.google.com/spreadsheets/d/1NNOqbJfcISFyMd0EsSrhppW7PT6GCfnrVGhxhLA5PVw
Moore’s Law might die in the short future, but I’ve yet to hear a convincing argument for when or why. Even if it does die, Cerebras presumably has at least 4 node shrinks left in the short term (16nm→10nm→7nm→5nm→3nm) for a >10x density scaling, and many sister technologies (3D stacking, silicon photonics, new non-volatile memories, cheaper fab tech) are far from exhausted. One can easily imagine a 3nm Cerebras waffle coated with a few layers of Nantero’s NRAM, with a few hundred of these connected together using low-latency silicon photonics. That would easily train quadrillion parameter models, using only technology already on our roadmap.
Alas, the nature of technology is that while there are many potential avenues for revolutionary improvement, only some small fraction of them win. So it’s probably wrong to look at any specific unproven technology as a given path to 10,000x scaling. But there are a lot of similarly revolutionary technologies, and so it’s much harder to say they will all fail.
Your estimates of hardware advancement seem higher than most people’s. I’ve enjoyed your comments on such things and think there should be a high-level, full length post on them, especially with widely respected posts claiming much longer times until human-level hardware.Would be willing to subsidize such a thing if you are interested. Would pay 500 USD to yourself or a charity of your choice for a post on the potential of ASICS, Moore’s law, how quickly we can overcome the memory bandwidth bottlenecks and such things. Would also subsidize a post estimating an answer this question, too: https://www.lesswrong.com/posts/7htxRA4TkHERiuPYK/parameter-vs-synapse
There’s a lot worth saying on these topics, I’ll give it a go.
Just posting in case you did not get my PM. It has my email in it.
Thanks, I did get the PM.
Was this ever posted?
Now posted: https://www.lesswrong.com/posts/aNAFrGbzXddQBMDqh/moore-s-law-ai-and-the-pace-of-progress
No, sorry.
Might be worth getting around to it:
https://semianalysis.com/tesla-dojo-ai-super-computer-unique-packaging-and-chip-design-allow-an-order-magnitude-advantage-over-competing-ai-hardware/
https://spectrum.ieee.org/cerebras-ai-computers https://www.servethehome.com/cerebras-wafer-scale-engine-2-wse-2-at-hot-chips-33/
‘“From talking to OpenAI, GPT-4 will be about 100 trillion parameters,” Feldman says. “That won’t be ready for several years.”’
Now posted: https://www.lesswrong.com/posts/aNAFrGbzXddQBMDqh/moore-s-law-ai-and-the-pace-of-progress
Is density even relevant when your computations can be run in parallel? I feel like price-performance will be the only relevant measure, even if that means slower clock cycles.
Density is important because it affects both price and communication speed. These are the fundamental roadblocks to building larger models. If you scale to too large clusters of computers, or primarily use high-density off-chip memory, you spend most of your time waiting for data to arrive in the right place.