I think people underestimate the degree to which hardware improvements enable software improvements. If you look at AlphaGo, the DeepMind team tried something like 17 different configurations during training runs before finally getting something to work. If each one of those had been twice as expensive, they might not have even conducted the experiment.
I do think it’s true that if we wait long enough, hardware restrictions will not be enough.
Yes, this is how everyone misinterprets these sorts of time-travel results. They do not tell you the causal role of ‘compute got cheaper’ vs ‘we thought real gud’. In reality, if we held FLOPs/$ constant, no matter how clever you were, most of those clever algorithmic innovations would not have happened*, either because compute shortages herd people into hand-engineering constant-factor tweaks which don’t actually work / have no lasting value (Bitter Lesson effect), people would not have done the trial-and-error which actually lies (if you will) behind the pretty stories they tell in the paper writeup, or because they would have proposed it with such unconvincing evaluations at such small scale it simply becomes yet another unread Arxiv paper (cue Raiders of the Lost Ark final scene). When you use methods to try to back out causal impact of compute increases on productivity, you get >50% - which must be the case: the time-travel method implicitly assumes that the only effect of hardware is to speed up the final version of the algorithm, which of course is an absurd assumption to make as everyone will admit that faster computers must help algorithm research at least a little, so the time-travel 50%s are only a loose lower-bound.
This is probably part of why people don’t think compute restrictions would work as well as they would, because they misread these essentially irrelevant & correlational results as causal marginal effects and think “well, 50% slowdown isn’t terribly useful, and anyway, wouldn’t people just do the research any way with relatively small runs? so this shouldn’t be a big priority”. (This model is obviously wrong as, among other things, it predicts that the deep learning revolution would have started long before it did instead of waiting for ultra-cheap GPUs within grad student budgets—small scale runs imply small scale budgets, and are a sign that compute is exorbitantly expensive rather than so ample they need only small amounts.)
What they do tell you is a useful forecasting thing along the lines of ‘the necessary IQ to destroy the world drops by 1 point every 18 months’. It’s plenty useful to know that given current trends, the hardware cost to run system X is going to predictably drop by Y factor every Z months. This is important to know for things like gauging hardware overhangs, or modeling possibilities for future models being deployed onto commodity hardware, etc. It’s just not that useful for controlling progress, rather than forecasting progress with no causal manipulation of factors going into the trends.
* Like, imagine if ’50% slowdown’ were accurate and computers froze ~2011, so you were still using the brandnew Nvidia GTX 590 with all of 3GB RAM and so we were only halfway through the DL revolution to date, so ~6 years ago vs our timeline’s 12 years from 2011: you really think that all the progress we made with fleets of later GPUs like V100s or A100s or H100s would happen, just half as fast? That we would be getting AlphaGo Master crushing all the human pros undefeated right about now? That a decade later OA would unveil GPT-4 (somehow) trained on its fleet of 2011 GPUs? And so on and so forth to all future systems like GPT-5 and whatnot?
Note that this probably doesn’t change the story much for GPU restrictions, though. For purposes of software improvements, one needs compute for lots of relatively small runs rather than one relatively big run, and lots of relatively small runs is exactly what GPU restrictions (as typically envisioned) would not block.
Couldn’t GPU restrictions still make them more expensive? Like let’s say tomorrow that we impose a tax on all new hardware that can be used to train neural networks such that any improvements in performance will be cancelled out by additional taxes. Wouldn’t that also slow down or even stop the growth of smaller training runs?
That would, and in general restrictions aimed at increasing price/reducing supply could work, though that doesn’t describe most GPU restriction proposals I’ve heard.
I think people underestimate the degree to which hardware improvements enable software improvements. If you look at AlphaGo, the DeepMind team tried something like 17 different configurations during training runs before finally getting something to work. If each one of those had been twice as expensive, they might not have even conducted the experiment.
I do think it’s true that if we wait long enough, hardware restrictions will not be enough.
Yes, this is how everyone misinterprets these sorts of time-travel results. They do not tell you the causal role of ‘compute got cheaper’ vs ‘we thought real gud’. In reality, if we held FLOPs/$ constant, no matter how clever you were, most of those clever algorithmic innovations would not have happened*, either because compute shortages herd people into hand-engineering constant-factor tweaks which don’t actually work / have no lasting value (Bitter Lesson effect), people would not have done the trial-and-error which actually lies (if you will) behind the pretty stories they tell in the paper writeup, or because they would have proposed it with such unconvincing evaluations at such small scale it simply becomes yet another unread Arxiv paper (cue Raiders of the Lost Ark final scene). When you use methods to try to back out causal impact of compute increases on productivity, you get >50% - which must be the case: the time-travel method implicitly assumes that the only effect of hardware is to speed up the final version of the algorithm, which of course is an absurd assumption to make as everyone will admit that faster computers must help algorithm research at least a little, so the time-travel 50%s are only a loose lower-bound.
This is probably part of why people don’t think compute restrictions would work as well as they would, because they misread these essentially irrelevant & correlational results as causal marginal effects and think “well, 50% slowdown isn’t terribly useful, and anyway, wouldn’t people just do the research any way with relatively small runs? so this shouldn’t be a big priority”. (This model is obviously wrong as, among other things, it predicts that the deep learning revolution would have started long before it did instead of waiting for ultra-cheap GPUs within grad student budgets—small scale runs imply small scale budgets, and are a sign that compute is exorbitantly expensive rather than so ample they need only small amounts.)
What they do tell you is a useful forecasting thing along the lines of ‘the necessary IQ to destroy the world drops by 1 point every 18 months’. It’s plenty useful to know that given current trends, the hardware cost to run system X is going to predictably drop by Y factor every Z months. This is important to know for things like gauging hardware overhangs, or modeling possibilities for future models being deployed onto commodity hardware, etc. It’s just not that useful for controlling progress, rather than forecasting progress with no causal manipulation of factors going into the trends.
* Like, imagine if ’50% slowdown’ were accurate and computers froze ~2011, so you were still using the brandnew Nvidia GTX 590 with all of 3GB RAM and so we were only halfway through the DL revolution to date, so ~6 years ago vs our timeline’s 12 years from 2011: you really think that all the progress we made with fleets of later GPUs like V100s or A100s or H100s would happen, just half as fast? That we would be getting AlphaGo Master crushing all the human pros undefeated right about now? That a decade later OA would unveil GPT-4 (somehow) trained on its fleet of 2011 GPUs? And so on and so forth to all future systems like GPT-5 and whatnot?
Note that this probably doesn’t change the story much for GPU restrictions, though. For purposes of software improvements, one needs compute for lots of relatively small runs rather than one relatively big run, and lots of relatively small runs is exactly what GPU restrictions (as typically envisioned) would not block.
Couldn’t GPU restrictions still make them more expensive? Like let’s say tomorrow that we impose a tax on all new hardware that can be used to train neural networks such that any improvements in performance will be cancelled out by additional taxes. Wouldn’t that also slow down or even stop the growth of smaller training runs?
That would, and in general restrictions aimed at increasing price/reducing supply could work, though that doesn’t describe most GPU restriction proposals I’ve heard.