My main point is twofold (I’ll just write GPU when I mean GPU / AI accelerator):
1. Destroying all GPUs is a stalling tactic, not a winning strategy. While CPUs are clearly much worse for AI than GPUs, they, and AI algorithms, should keep improving over time. State-of-the-art models from less than ten years ago can be run on CPUs today, with little loss in accuracy. If this trend continues, GPUs vs CPUs only seems to be of short-term importance. Regarding your point about having to train a dense net on GPUs before sparsification, I’m not sure that that’s the case. I’m in the process of reading this “Sparsity in Deep Learning”-paper, and it does seem to me that you can train neural networks sparsely. You’d do that by starting small, then during training increasing the network size by some methodology, followed by sparsification again (over and over). I don’t have super high confidence about this (and have Covid, so am too tired to look it up), but I believe that AGI-armageddon by CPU is at least in the realm of possibilities (assuming no GPUs - it’s the “cancer kills you if you don’t die of a heart attack before” of AGI Doom).
2. It doesn’t matter anyway, because destroying all GPUs is not really that pivotal of an act (in the long-term, AI safety sense). Either you keep an AI around that enforces the “no GPU” rule, or you destroy once and wait. The former either means that GPUs don’t matter for AGI (so why bother), or that there are still GPUs (which seems contradictory). The latter means that more GPUs will be built in time and you will find yourself in the same position as before, except that you are likely in prison or dead, and so not in a position to do anything about AGI this time. After all, destroying all GPUs in the world would not be something that most people would look upon kindly. This means that a super-intelligent GPU-minimizer would realize that its goal would best be served by wiping out all intelligent life on Earth (or all life, or maybe all intelligent life in the Universe....).
In some sense, the comment was a way for me to internally make plausible the claim that destroying all GPUs in the world is not an alignable act.
I didn’t read Eliezer as suggesting a single GPU burn and then the nanobots all, I dunno, fry themselves and never exist again. More as a persistent thing. And burning all GPUs persistently does seem quite pivotal: maybe if the AGI confined itself to solely that and never did anything again, eventually someone would accumulate enough CPUs and spend so much money as to create a new AGI using only hardware which doesn’t violate the first AGI’s definition of ‘GPU’ (presumably they know about the loophole otherwise who would ever even try?), but that will take a long time and is approaching angels-on-pinheads sorts of specificity. (If a ‘pivotal act’ needs to guarantee safety until the sun goes red giant in a billion years, this may be too stringent a definition to be of any use. We don’t demand that sort of solution for anything else.)
While CPUs are clearly much worse for AI than GPUs, they, and AI algorithms, should keep improving over time.
CPUs are improving slowly, and are fundamentally unsuited to DL right now, so I’m doubtful that waiting a decade is going to give us amazing CPUs which can do DL at the level of, say, a Nvidia H100 (itself potentially still very underpowered compared to the GPUs you’d need for AGI).
By AI algorithm progress, I assume you mean something like the Hernandez progress law?
It’s worth pointing out that the Hernandez experience curve is still pretty slow compared to the GPU vs GPU gap. A GPU is like 20x better, and Hernandez is a halving of cost every 16 months due to hardware+software improvement; even at face-value, you’d need at least 5 halvings to catch up, taking at least half a decade. Worse, ‘hardware’ here means ‘GPU’, of course, so Hernandez is an overestimate of a hypothetical ‘CPU’ curve, so you’re talking more like decades. Actually, it’s worse than that, because ‘software’ here means ‘all of the accelerated R&D enabled by GPUs being so awesome and letting us try out lots of things by trial-and-error’; experience curves are actually caused by the number of cumulative ‘units’, and not by mere passage of time (progress doesn’t just drop out of the sky, people have to do stuff), so if you slow down the number of NNs which can be trained (because you can only use 20x worse CPUs), it takes far longer to train twice as many NNs as trained cumulatively to date. (And if the CPUs are being improved to train NNs, then you might have a synergistic slowdown on top of that because you don’t know what to optimize your new CPUs for when the old CPUs are still sluggishly cranking along running your experimental NNs.) So, even with zero interference or regulation other than not being able to use GPUs, progress will slam abruptly to a crawl compared to what you’re used to now. (One reason I think Chinese DL may be badly handicapped as time passes: they can windowshop Western DL on Arxiv, certainly, which can be useful, but not gain the necessary tacit practical knowledge to exploit it fully or do anything novel & important.)
Finally, it may be halving up to now, but there is presumably (just like DL scaling laws) some ‘irreducible loss’ or asymptote. After all, no matter how many pickup trucks Ford manufactures, you don’t expect the cost of a truck to hit $1; no matter how clever people are, presumably there’s always going to be some minimum number of computations it takes to train a good ImageNet classifier. It may be that while progress never technically stops, it simply asymptotes at a cost so high that no one will ever try to pay it. Who’s going to pay for the chip fabs, which double in cost every generation? Who’s going to risk paying for the chip fabs, for that matter? It’s a discrete thing, and the ratchet may just stop turning. (This is also a problem for the experience curve itself: you might just hit a point where no one makes another unit, because they don’t want to, or they are copying previously-trained models. No additional units, no progress along the experience curve. And then you have ‘bitrot’… Technologies can be uninvented if no one no longer knows how to make them.)
I’m in the process of reading this “Sparsity in Deep Learning”-paper, and it does seem to me that you can train neural networks sparsely. You’d do that by starting small, then during training increasing the network size by some methodology, followed by sparsification again (over and over).
I don’t think that works. (Not immediately finding anything in that PDF about training small models up to large in a purely sparse/CPU-friendly manner.) And every time you increase, you’re back in the dense regime where GPUs win. (Note that even MoEs are basically just ways to orchestrate a bunch of dense models, ideally one per node.) What you need is some really fine-grained sparsity with complex control flow and many many zeros where CPUs can compete with GPUs. I don’t deny that there is probably some way to train models this way, but past efforts have not been successful and it’s not looking good for the foreseeable future either. Dense models, like vanilla Transformers, turn out to be really good at making GPUs-go-brrrr and that turns out to usually be the most important property of an arch.
Yeah, I was kind of rambling, sorry.
My main point is twofold (I’ll just write GPU when I mean GPU / AI accelerator):
1. Destroying all GPUs is a stalling tactic, not a winning strategy. While CPUs are clearly much worse for AI than GPUs, they, and AI algorithms, should keep improving over time. State-of-the-art models from less than ten years ago can be run on CPUs today, with little loss in accuracy. If this trend continues, GPUs vs CPUs only seems to be of short-term importance. Regarding your point about having to train a dense net on GPUs before sparsification, I’m not sure that that’s the case. I’m in the process of reading this “Sparsity in Deep Learning”-paper, and it does seem to me that you can train neural networks sparsely. You’d do that by starting small, then during training increasing the network size by some methodology, followed by sparsification again (over and over). I don’t have super high confidence about this (and have Covid, so am too tired to look it up), but I believe that AGI-armageddon by CPU is at least in the realm of possibilities (assuming no GPUs - it’s the “cancer kills you if you don’t die of a heart attack before” of AGI Doom).
2. It doesn’t matter anyway, because destroying all GPUs is not really that pivotal of an act (in the long-term, AI safety sense). Either you keep an AI around that enforces the “no GPU” rule, or you destroy once and wait. The former either means that GPUs don’t matter for AGI (so why bother), or that there are still GPUs (which seems contradictory). The latter means that more GPUs will be built in time and you will find yourself in the same position as before, except that you are likely in prison or dead, and so not in a position to do anything about AGI this time. After all, destroying all GPUs in the world would not be something that most people would look upon kindly. This means that a super-intelligent GPU-minimizer would realize that its goal would best be served by wiping out all intelligent life on Earth (or all life, or maybe all intelligent life in the Universe....).
In some sense, the comment was a way for me to internally make plausible the claim that destroying all GPUs in the world is not an alignable act.
I didn’t read Eliezer as suggesting a single GPU burn and then the nanobots all, I dunno, fry themselves and never exist again. More as a persistent thing. And burning all GPUs persistently does seem quite pivotal: maybe if the AGI confined itself to solely that and never did anything again, eventually someone would accumulate enough CPUs and spend so much money as to create a new AGI using only hardware which doesn’t violate the first AGI’s definition of ‘GPU’ (presumably they know about the loophole otherwise who would ever even try?), but that will take a long time and is approaching angels-on-pinheads sorts of specificity. (If a ‘pivotal act’ needs to guarantee safety until the sun goes red giant in a billion years, this may be too stringent a definition to be of any use. We don’t demand that sort of solution for anything else.)
CPUs are improving slowly, and are fundamentally unsuited to DL right now, so I’m doubtful that waiting a decade is going to give us amazing CPUs which can do DL at the level of, say, a Nvidia H100 (itself potentially still very underpowered compared to the GPUs you’d need for AGI).
By AI algorithm progress, I assume you mean something like the Hernandez progress law?
It’s worth pointing out that the Hernandez experience curve is still pretty slow compared to the GPU vs GPU gap. A GPU is like 20x better, and Hernandez is a halving of cost every 16 months due to hardware+software improvement; even at face-value, you’d need at least 5 halvings to catch up, taking at least half a decade. Worse, ‘hardware’ here means ‘GPU’, of course, so Hernandez is an overestimate of a hypothetical ‘CPU’ curve, so you’re talking more like decades. Actually, it’s worse than that, because ‘software’ here means ‘all of the accelerated R&D enabled by GPUs being so awesome and letting us try out lots of things by trial-and-error’; experience curves are actually caused by the number of cumulative ‘units’, and not by mere passage of time (progress doesn’t just drop out of the sky, people have to do stuff), so if you slow down the number of NNs which can be trained (because you can only use 20x worse CPUs), it takes far longer to train twice as many NNs as trained cumulatively to date. (And if the CPUs are being improved to train NNs, then you might have a synergistic slowdown on top of that because you don’t know what to optimize your new CPUs for when the old CPUs are still sluggishly cranking along running your experimental NNs.) So, even with zero interference or regulation other than not being able to use GPUs, progress will slam abruptly to a crawl compared to what you’re used to now. (One reason I think Chinese DL may be badly handicapped as time passes: they can windowshop Western DL on Arxiv, certainly, which can be useful, but not gain the necessary tacit practical knowledge to exploit it fully or do anything novel & important.)
Finally, it may be halving up to now, but there is presumably (just like DL scaling laws) some ‘irreducible loss’ or asymptote. After all, no matter how many pickup trucks Ford manufactures, you don’t expect the cost of a truck to hit $1; no matter how clever people are, presumably there’s always going to be some minimum number of computations it takes to train a good ImageNet classifier. It may be that while progress never technically stops, it simply asymptotes at a cost so high that no one will ever try to pay it. Who’s going to pay for the chip fabs, which double in cost every generation? Who’s going to risk paying for the chip fabs, for that matter? It’s a discrete thing, and the ratchet may just stop turning. (This is also a problem for the experience curve itself: you might just hit a point where no one makes another unit, because they don’t want to, or they are copying previously-trained models. No additional units, no progress along the experience curve. And then you have ‘bitrot’… Technologies can be uninvented if no one no longer knows how to make them.)
I don’t think that works. (Not immediately finding anything in that PDF about training small models up to large in a purely sparse/CPU-friendly manner.) And every time you increase, you’re back in the dense regime where GPUs win. (Note that even MoEs are basically just ways to orchestrate a bunch of dense models, ideally one per node.) What you need is some really fine-grained sparsity with complex control flow and many many zeros where CPUs can compete with GPUs. I don’t deny that there is probably some way to train models this way, but past efforts have not been successful and it’s not looking good for the foreseeable future either. Dense models, like vanilla Transformers, turn out to be really good at making GPUs-go-brrrr and that turns out to usually be the most important property of an arch.