I didn’t read Eliezer as suggesting a single GPU burn and then the nanobots all, I dunno, fry themselves and never exist again. More as a persistent thing. And burning all GPUs persistently does seem quite pivotal: maybe if the AGI confined itself to solely that and never did anything again, eventually someone would accumulate enough CPUs and spend so much money as to create a new AGI using only hardware which doesn’t violate the first AGI’s definition of ‘GPU’ (presumably they know about the loophole otherwise who would ever even try?), but that will take a long time and is approaching angels-on-pinheads sorts of specificity. (If a ‘pivotal act’ needs to guarantee safety until the sun goes red giant in a billion years, this may be too stringent a definition to be of any use. We don’t demand that sort of solution for anything else.)
While CPUs are clearly much worse for AI than GPUs, they, and AI algorithms, should keep improving over time.
CPUs are improving slowly, and are fundamentally unsuited to DL right now, so I’m doubtful that waiting a decade is going to give us amazing CPUs which can do DL at the level of, say, a Nvidia H100 (itself potentially still very underpowered compared to the GPUs you’d need for AGI).
By AI algorithm progress, I assume you mean something like the Hernandez progress law?
It’s worth pointing out that the Hernandez experience curve is still pretty slow compared to the GPU vs GPU gap. A GPU is like 20x better, and Hernandez is a halving of cost every 16 months due to hardware+software improvement; even at face-value, you’d need at least 5 halvings to catch up, taking at least half a decade. Worse, ‘hardware’ here means ‘GPU’, of course, so Hernandez is an overestimate of a hypothetical ‘CPU’ curve, so you’re talking more like decades. Actually, it’s worse than that, because ‘software’ here means ‘all of the accelerated R&D enabled by GPUs being so awesome and letting us try out lots of things by trial-and-error’; experience curves are actually caused by the number of cumulative ‘units’, and not by mere passage of time (progress doesn’t just drop out of the sky, people have to do stuff), so if you slow down the number of NNs which can be trained (because you can only use 20x worse CPUs), it takes far longer to train twice as many NNs as trained cumulatively to date. (And if the CPUs are being improved to train NNs, then you might have a synergistic slowdown on top of that because you don’t know what to optimize your new CPUs for when the old CPUs are still sluggishly cranking along running your experimental NNs.) So, even with zero interference or regulation other than not being able to use GPUs, progress will slam abruptly to a crawl compared to what you’re used to now. (One reason I think Chinese DL may be badly handicapped as time passes: they can windowshop Western DL on Arxiv, certainly, which can be useful, but not gain the necessary tacit practical knowledge to exploit it fully or do anything novel & important.)
Finally, it may be halving up to now, but there is presumably (just like DL scaling laws) some ‘irreducible loss’ or asymptote. After all, no matter how many pickup trucks Ford manufactures, you don’t expect the cost of a truck to hit $1; no matter how clever people are, presumably there’s always going to be some minimum number of computations it takes to train a good ImageNet classifier. It may be that while progress never technically stops, it simply asymptotes at a cost so high that no one will ever try to pay it. Who’s going to pay for the chip fabs, which double in cost every generation? Who’s going to risk paying for the chip fabs, for that matter? It’s a discrete thing, and the ratchet may just stop turning. (This is also a problem for the experience curve itself: you might just hit a point where no one makes another unit, because they don’t want to, or they are copying previously-trained models. No additional units, no progress along the experience curve. And then you have ‘bitrot’… Technologies can be uninvented if no one no longer knows how to make them.)
I’m in the process of reading this “Sparsity in Deep Learning”-paper, and it does seem to me that you can train neural networks sparsely. You’d do that by starting small, then during training increasing the network size by some methodology, followed by sparsification again (over and over).
I don’t think that works. (Not immediately finding anything in that PDF about training small models up to large in a purely sparse/CPU-friendly manner.) And every time you increase, you’re back in the dense regime where GPUs win. (Note that even MoEs are basically just ways to orchestrate a bunch of dense models, ideally one per node.) What you need is some really fine-grained sparsity with complex control flow and many many zeros where CPUs can compete with GPUs. I don’t deny that there is probably some way to train models this way, but past efforts have not been successful and it’s not looking good for the foreseeable future either. Dense models, like vanilla Transformers, turn out to be really good at making GPUs-go-brrrr and that turns out to usually be the most important property of an arch.
I didn’t read Eliezer as suggesting a single GPU burn and then the nanobots all, I dunno, fry themselves and never exist again. More as a persistent thing. And burning all GPUs persistently does seem quite pivotal: maybe if the AGI confined itself to solely that and never did anything again, eventually someone would accumulate enough CPUs and spend so much money as to create a new AGI using only hardware which doesn’t violate the first AGI’s definition of ‘GPU’ (presumably they know about the loophole otherwise who would ever even try?), but that will take a long time and is approaching angels-on-pinheads sorts of specificity. (If a ‘pivotal act’ needs to guarantee safety until the sun goes red giant in a billion years, this may be too stringent a definition to be of any use. We don’t demand that sort of solution for anything else.)
CPUs are improving slowly, and are fundamentally unsuited to DL right now, so I’m doubtful that waiting a decade is going to give us amazing CPUs which can do DL at the level of, say, a Nvidia H100 (itself potentially still very underpowered compared to the GPUs you’d need for AGI).
By AI algorithm progress, I assume you mean something like the Hernandez progress law?
It’s worth pointing out that the Hernandez experience curve is still pretty slow compared to the GPU vs GPU gap. A GPU is like 20x better, and Hernandez is a halving of cost every 16 months due to hardware+software improvement; even at face-value, you’d need at least 5 halvings to catch up, taking at least half a decade. Worse, ‘hardware’ here means ‘GPU’, of course, so Hernandez is an overestimate of a hypothetical ‘CPU’ curve, so you’re talking more like decades. Actually, it’s worse than that, because ‘software’ here means ‘all of the accelerated R&D enabled by GPUs being so awesome and letting us try out lots of things by trial-and-error’; experience curves are actually caused by the number of cumulative ‘units’, and not by mere passage of time (progress doesn’t just drop out of the sky, people have to do stuff), so if you slow down the number of NNs which can be trained (because you can only use 20x worse CPUs), it takes far longer to train twice as many NNs as trained cumulatively to date. (And if the CPUs are being improved to train NNs, then you might have a synergistic slowdown on top of that because you don’t know what to optimize your new CPUs for when the old CPUs are still sluggishly cranking along running your experimental NNs.) So, even with zero interference or regulation other than not being able to use GPUs, progress will slam abruptly to a crawl compared to what you’re used to now. (One reason I think Chinese DL may be badly handicapped as time passes: they can windowshop Western DL on Arxiv, certainly, which can be useful, but not gain the necessary tacit practical knowledge to exploit it fully or do anything novel & important.)
Finally, it may be halving up to now, but there is presumably (just like DL scaling laws) some ‘irreducible loss’ or asymptote. After all, no matter how many pickup trucks Ford manufactures, you don’t expect the cost of a truck to hit $1; no matter how clever people are, presumably there’s always going to be some minimum number of computations it takes to train a good ImageNet classifier. It may be that while progress never technically stops, it simply asymptotes at a cost so high that no one will ever try to pay it. Who’s going to pay for the chip fabs, which double in cost every generation? Who’s going to risk paying for the chip fabs, for that matter? It’s a discrete thing, and the ratchet may just stop turning. (This is also a problem for the experience curve itself: you might just hit a point where no one makes another unit, because they don’t want to, or they are copying previously-trained models. No additional units, no progress along the experience curve. And then you have ‘bitrot’… Technologies can be uninvented if no one no longer knows how to make them.)
I don’t think that works. (Not immediately finding anything in that PDF about training small models up to large in a purely sparse/CPU-friendly manner.) And every time you increase, you’re back in the dense regime where GPUs win. (Note that even MoEs are basically just ways to orchestrate a bunch of dense models, ideally one per node.) What you need is some really fine-grained sparsity with complex control flow and many many zeros where CPUs can compete with GPUs. I don’t deny that there is probably some way to train models this way, but past efforts have not been successful and it’s not looking good for the foreseeable future either. Dense models, like vanilla Transformers, turn out to be really good at making GPUs-go-brrrr and that turns out to usually be the most important property of an arch.