This is because longer runs will be outcompeted by runs that start later and therefore use better hardware and better algorithms.
Wouldn’t companies port their partially-trained models to new hardware? I guess the assumption here is that when more compute is available, actors will want to train larger models. I don’t think this is obviously true because: 1. Data may be the bigger bottleneck. There was some discussion of this here. Making models larger doesn’t help very much after a certain point compared with training them with more data. 2. If training runs are happening over months, there will be strong incentives to make use of previously trained models—especially in a world where people are racing to build AGI. This could look like anything from slapping on more layers to developing algorithms that expand the model in all relevant dimensions as it is being trained. Here’s a paper about progressive learning for vision transformers. I didn’t find anything for NLP, but I also haven’t looked very hard.
Not necessarily larger, but different. Presumably new hardware will have different performance characteristics than the old hardware (otherwise what’s the point?); it seems unlikely that future GPUs will simply be exactly like the old GPU but using half the electricity, say. (Even in that scenario, since electricity is such a major cost, why wouldn’t you then add more GPUs to your cluster to use up the new headroom?)
When we look at past changes like V100 to A100, or A100 to H100, they typically change the performance profile quite a bit: VRAM doubles or more, high-precision ops increase much less than low-precision, new numerical formats get native speed support, new specialized hardware like ‘tensor cores’ get added encouraging sparsity or reduced-precision, interconnects speed up (but never enough)… All of these are going to change your ideal width vs depth scaling ratios, Transformer head sizes or MoE expert sizes (trying to keep on-GPU) or the size of your model components in general, your other hyperparameters like total batch size, and so on.
Changes like precision can require architecture-level changes like more aggressive normalization or regularization (maybe your model will Just Work when you switch to mixed-precision for the performance boost—or maybe it will keep exploding until you throw in more layer normalization to keep all the numbers small), or may just not work at all at present.
You may be able to checkpoint your model and restart if a node crashes or if a minibatch diverges, but that’s no panacea, DL is non-convex and different runs will end up in different places, and the seeds of decay & self-sabotage may be planted too deep in a model to be fixed: in the BigGAN paper, mooch tried extensively rolling back BigGANs that diverged, but even resetting back thousands of iterations didn’t halt eventual divergence (we verified this the hard way, as hope is a cruel mistress); in the PaLM work, they found some minibatches just spike the loss, and it’s not due to the individual datapoints (they had bit-for-bit reproducibility—very impressive! - and could swap out the data to check), it just sorta happens. (Why? Dunno.) There is also the learning that happens during the course of training: most recently, people were very amused/depressed to read through the Facebook OPT training logs about all the bugs, similar stories are told by anyone working on GPT-J or GPT-Neo-20b or HyperCLOVA, Anthropic and OA likewise; one particularly dramatic example I like is OA’s “Rerun” DoTA2 OA5 agent - they were editing the arch & hyperparameters (in addition to keeping up with game patches) the entire time, and so at the end, they ‘reran’ the training process from scratch rather than upgrading progressively the same agent: Rerun required only 20% of the training for a
98% win-rate against the final version of OpenAI Five...The ideal option would be to run Rerun-like training from the very start, but this is impossible—the OpenAI Five curve represents lessons learned that led to the final codebase, environment, etc., without which it would not be possible to train Rerun.
Quite a difference.
So, yes, you can ‘just copy over’ your big old model onto your shiny new cluster, but you are going to pay a price. The size of this price compared to starting from scratch will depend on how extensive the hardware+software changes are, how much coevolution is going on, how path-dependent large models turn out to be (very unclear because few people train more than one), etc, but the price will be nonzero. Your utilization will be lower because now you underuse each node’s VRAM, or you could have packed in larger model-shards into each node, or you need to economize more on inter-node bandwidth, and perhaps this is a small price worth paying; or perhaps your upgrade of the optimizer mid-way permanently hobbled convergence in a way you haven’t noticed such that no amount of training will help you match the from-scratch version, and every FLOP is wasted as you will never achieve the target goals.
How bad will the price be be? I guess that will depend on how much software and hardware innovation, and what sort, you expect. If you are looking forward to things like binary-weight nets (which are ultra-fast because they are now just bit operations like xor or popcount), you should expect to have to throw away all your prior models, they really do not like that sort of major change, whatever sort of approach makes them work is not going to play nice with Ye Olde FP32 GPT-3 models, and even when you can convert successfully, you probably can’t train them much more. Trying to save compute by transferring old models is then just throwing good compute after bad. Whereas if you are looking forward to innovations focusing on datasets and expect GPUs to remain pretty much as they are now with lots of FP16 multiplication (but nothing crazy like ternary weights or pervasive sparsity or wacky approaches like HyperNEAT-style evolved topologies or Cerebras chips or spiking neural-network hardware), then probably you can plan to just continually train and upgrade a single Chinchilla model indefinitely, and the savings from better hyperparameter tuning etc will be unimportant constant-factors like a third, let’s say, which is not enough to justify retraining from scratch until you have some better reason to do so like a new arch.
Since the critical decision is to throw out the old model/arch/run, a big enough change on either hardware or software can trigger a new-run decision, in which case you then pick up the gains from the other one as well. (That is, if some new hardware comes out and your old model is not well-suited to it, then when you start a fresh model, you’ll probably also roll in all the software improvements which have happened since eons ago, a year or two or so.) So there’s something of a double overhang: regular progress on both streams will lead to smoother capability gains as people regularly start new models and eat up the gains on both, but if one stagnates, then that will tend to lock-in that generation of models and one will want to delay a new model as long as possible, until the marginal return from software+hardware upgrades is so large it can pay for the fully-loaded training cost in one fell swoop. The average trend might be identical, as everyone continues to optimize on the margin, but the latter scenario seems like it would be much more jagged.
(A concrete example of this might be that Stable Diffusion is having such a moment right now in part because it benefits from high-end consumer GPUs, and those GPUs very abruptly became available recently at much closer to MSRP than they have been in years, so people who have been running old image generation models on old GPUs like 1080tis are suddenly running Stable Diffusion on 3090s. I’m sure the FID/IS improvement curves aggregated across research papers are as exactly as smooth as AI Impacts or Paul Christiano would assert they are, but from the perspective of, say, artists suddenly being smacked across the face with SD images everywhere almost literally overnight when the SD model leaked a week or two ago, it sure doesn’t feel smooth.)
Wouldn’t companies port their partially-trained models to new hardware? I guess the assumption here is that when more compute is available, actors will want to train larger models. I don’t think this is obviously true because:
1. Data may be the bigger bottleneck. There was some discussion of this here. Making models larger doesn’t help very much after a certain point compared with training them with more data.
2. If training runs are happening over months, there will be strong incentives to make use of previously trained models—especially in a world where people are racing to build AGI. This could look like anything from slapping on more layers to developing algorithms that expand the model in all relevant dimensions as it is being trained. Here’s a paper about progressive learning for vision transformers. I didn’t find anything for NLP, but I also haven’t looked very hard.
Not necessarily larger, but different. Presumably new hardware will have different performance characteristics than the old hardware (otherwise what’s the point?); it seems unlikely that future GPUs will simply be exactly like the old GPU but using half the electricity, say. (Even in that scenario, since electricity is such a major cost, why wouldn’t you then add more GPUs to your cluster to use up the new headroom?)
When we look at past changes like V100 to A100, or A100 to H100, they typically change the performance profile quite a bit: VRAM doubles or more, high-precision ops increase much less than low-precision, new numerical formats get native speed support, new specialized hardware like ‘tensor cores’ get added encouraging sparsity or reduced-precision, interconnects speed up (but never enough)… All of these are going to change your ideal width vs depth scaling ratios, Transformer head sizes or MoE expert sizes (trying to keep on-GPU) or the size of your model components in general, your other hyperparameters like total batch size, and so on.
Changes like precision can require architecture-level changes like more aggressive normalization or regularization (maybe your model will Just Work when you switch to mixed-precision for the performance boost—or maybe it will keep exploding until you throw in more layer normalization to keep all the numbers small), or may just not work at all at present.
You may be able to checkpoint your model and restart if a node crashes or if a minibatch diverges, but that’s no panacea, DL is non-convex and different runs will end up in different places, and the seeds of decay & self-sabotage may be planted too deep in a model to be fixed: in the BigGAN paper, mooch tried extensively rolling back BigGANs that diverged, but even resetting back thousands of iterations didn’t halt eventual divergence (we verified this the hard way, as hope is a cruel mistress); in the PaLM work, they found some minibatches just spike the loss, and it’s not due to the individual datapoints (they had bit-for-bit reproducibility—very impressive! - and could swap out the data to check), it just sorta happens. (Why? Dunno.) There is also the learning that happens during the course of training: most recently, people were very amused/depressed to read through the Facebook OPT training logs about all the bugs, similar stories are told by anyone working on GPT-J or GPT-Neo-20b or HyperCLOVA, Anthropic and OA likewise; one particularly dramatic example I like is OA’s “Rerun” DoTA2 OA5 agent - they were editing the arch & hyperparameters (in addition to keeping up with game patches) the entire time, and so at the end, they ‘reran’ the training process from scratch rather than upgrading progressively the same agent: Rerun required only 20% of the training for a
Quite a difference.
So, yes, you can ‘just copy over’ your big old model onto your shiny new cluster, but you are going to pay a price. The size of this price compared to starting from scratch will depend on how extensive the hardware+software changes are, how much coevolution is going on, how path-dependent large models turn out to be (very unclear because few people train more than one), etc, but the price will be nonzero. Your utilization will be lower because now you underuse each node’s VRAM, or you could have packed in larger model-shards into each node, or you need to economize more on inter-node bandwidth, and perhaps this is a small price worth paying; or perhaps your upgrade of the optimizer mid-way permanently hobbled convergence in a way you haven’t noticed such that no amount of training will help you match the from-scratch version, and every FLOP is wasted as you will never achieve the target goals.
How bad will the price be be? I guess that will depend on how much software and hardware innovation, and what sort, you expect. If you are looking forward to things like binary-weight nets (which are ultra-fast because they are now just bit operations like xor or popcount), you should expect to have to throw away all your prior models, they really do not like that sort of major change, whatever sort of approach makes them work is not going to play nice with Ye Olde FP32 GPT-3 models, and even when you can convert successfully, you probably can’t train them much more. Trying to save compute by transferring old models is then just throwing good compute after bad. Whereas if you are looking forward to innovations focusing on datasets and expect GPUs to remain pretty much as they are now with lots of FP16 multiplication (but nothing crazy like ternary weights or pervasive sparsity or wacky approaches like HyperNEAT-style evolved topologies or Cerebras chips or spiking neural-network hardware), then probably you can plan to just continually train and upgrade a single Chinchilla model indefinitely, and the savings from better hyperparameter tuning etc will be unimportant constant-factors like a third, let’s say, which is not enough to justify retraining from scratch until you have some better reason to do so like a new arch.
Since the critical decision is to throw out the old model/arch/run, a big enough change on either hardware or software can trigger a new-run decision, in which case you then pick up the gains from the other one as well. (That is, if some new hardware comes out and your old model is not well-suited to it, then when you start a fresh model, you’ll probably also roll in all the software improvements which have happened since eons ago, a year or two or so.) So there’s something of a double overhang: regular progress on both streams will lead to smoother capability gains as people regularly start new models and eat up the gains on both, but if one stagnates, then that will tend to lock-in that generation of models and one will want to delay a new model as long as possible, until the marginal return from software+hardware upgrades is so large it can pay for the fully-loaded training cost in one fell swoop. The average trend might be identical, as everyone continues to optimize on the margin, but the latter scenario seems like it would be much more jagged.
(A concrete example of this might be that Stable Diffusion is having such a moment right now in part because it benefits from high-end consumer GPUs, and those GPUs very abruptly became available recently at much closer to MSRP than they have been in years, so people who have been running old image generation models on old GPUs like 1080tis are suddenly running Stable Diffusion on 3090s. I’m sure the FID/IS improvement curves aggregated across research papers are as exactly as smooth as AI Impacts or Paul Christiano would assert they are, but from the perspective of, say, artists suddenly being smacked across the face with SD images everywhere almost literally overnight when the SD model leaked a week or two ago, it sure doesn’t feel smooth.)