Moore’s Law might die in the short future, but I’ve yet to hear a convincing argument for when or why. Even if it does die, Cerebras presumably has at least 4 node shrinks left in the short term (16nm→10nm→7nm→5nm→3nm) for a >10x density scaling, and many sister technologies (3D stacking, silicon photonics, new non-volatile memories, cheaper fab tech) are far from exhausted. One can easily imagine a 3nm Cerebras waffle coated with a few layers of Nantero’s NRAM, with a few hundred of these connected together using low-latency silicon photonics. That would easily train quadrillion parameter models, using only technology already on our roadmap.
Alas, the nature of technology is that while there are many potential avenues for revolutionary improvement, only some small fraction of them win. So it’s probably wrong to look at any specific unproven technology as a given path to 10,000x scaling. But there are a lot of similarly revolutionary technologies, and so it’s much harder to say they will all fail.
Your estimates of hardware advancement seem higher than most people’s. I’ve enjoyed your comments on such things and think there should be a high-level, full length post on them, especially with widely respected posts claiming much longer times until human-level hardware.Would be willing to subsidize such a thing if you are interested. Would pay 500 USD to yourself or a charity of your choice for a post on the potential of ASICS, Moore’s law, how quickly we can overcome the memory bandwidth bottlenecks and such things. Would also subsidize a post estimating an answer this question, too: https://www.lesswrong.com/posts/7htxRA4TkHERiuPYK/parameter-vs-synapse
Is density even relevant when your computations can be run in parallel? I feel like price-performance will be the only relevant measure, even if that means slower clock cycles.
Density is important because it affects both price and communication speed. These are the fundamental roadblocks to building larger models. If you scale to too large clusters of computers, or primarily use high-density off-chip memory, you spend most of your time waiting for data to arrive in the right place.
Moore’s Law is not dead. I could rant about the market dynamics that made people think otherwise, but it’s easier just to point to the data.
https://docs.google.com/spreadsheets/d/1NNOqbJfcISFyMd0EsSrhppW7PT6GCfnrVGhxhLA5PVw
Moore’s Law might die in the short future, but I’ve yet to hear a convincing argument for when or why. Even if it does die, Cerebras presumably has at least 4 node shrinks left in the short term (16nm→10nm→7nm→5nm→3nm) for a >10x density scaling, and many sister technologies (3D stacking, silicon photonics, new non-volatile memories, cheaper fab tech) are far from exhausted. One can easily imagine a 3nm Cerebras waffle coated with a few layers of Nantero’s NRAM, with a few hundred of these connected together using low-latency silicon photonics. That would easily train quadrillion parameter models, using only technology already on our roadmap.
Alas, the nature of technology is that while there are many potential avenues for revolutionary improvement, only some small fraction of them win. So it’s probably wrong to look at any specific unproven technology as a given path to 10,000x scaling. But there are a lot of similarly revolutionary technologies, and so it’s much harder to say they will all fail.
Your estimates of hardware advancement seem higher than most people’s. I’ve enjoyed your comments on such things and think there should be a high-level, full length post on them, especially with widely respected posts claiming much longer times until human-level hardware.Would be willing to subsidize such a thing if you are interested. Would pay 500 USD to yourself or a charity of your choice for a post on the potential of ASICS, Moore’s law, how quickly we can overcome the memory bandwidth bottlenecks and such things. Would also subsidize a post estimating an answer this question, too: https://www.lesswrong.com/posts/7htxRA4TkHERiuPYK/parameter-vs-synapse
There’s a lot worth saying on these topics, I’ll give it a go.
Just posting in case you did not get my PM. It has my email in it.
Thanks, I did get the PM.
Was this ever posted?
Now posted: https://www.lesswrong.com/posts/aNAFrGbzXddQBMDqh/moore-s-law-ai-and-the-pace-of-progress
No, sorry.
Might be worth getting around to it:
https://semianalysis.com/tesla-dojo-ai-super-computer-unique-packaging-and-chip-design-allow-an-order-magnitude-advantage-over-competing-ai-hardware/
https://spectrum.ieee.org/cerebras-ai-computers https://www.servethehome.com/cerebras-wafer-scale-engine-2-wse-2-at-hot-chips-33/
‘“From talking to OpenAI, GPT-4 will be about 100 trillion parameters,” Feldman says. “That won’t be ready for several years.”’
Now posted: https://www.lesswrong.com/posts/aNAFrGbzXddQBMDqh/moore-s-law-ai-and-the-pace-of-progress
Is density even relevant when your computations can be run in parallel? I feel like price-performance will be the only relevant measure, even if that means slower clock cycles.
Density is important because it affects both price and communication speed. These are the fundamental roadblocks to building larger models. If you scale to too large clusters of computers, or primarily use high-density off-chip memory, you spend most of your time waiting for data to arrive in the right place.