Claude 3 Opus, Llama 3 405B, and Claude 3.5 Sonnet are clearly somewhat better than original GPT 4, with maybe only 10x in FLOPs scaling at most since then. And there is at least 100x more to come within a few years. Planned datacenter buildup is 3 OOMs above largest training runs for currently deployed models, Llama 3 405B was trained using 16K H100s, while Nvidia shipped 3.7M GPUs in 2023. When a major part of progress is waiting for the training clusters to get built, it’s too early to call when they haven’t been built yet.
The key point is that a training run is not fundamentally constrained to a single datacenter or a single campus. Doing otherwise is more complicated, likely less efficient, and at the scale of currently deployed models unnecessary. Another popular concern is the data wall. But it seems that there is still enough data even for future training runs that run across multiple datacenters, it just won’t allow making inference-efficient overtrained models using all that compute. Both points are based on conservative estimates that don’t assume algorithmic breakthroughs. Also, the current models are still trained quickly, while at the boundaries of technical feasiblity it would make more sense to perform very long training runs.
For training across multiple datacenters, one way is continuing to push data parallelism with minibatching. Many instances of a model are separately observing multiple samples, and once they are done the updates are collected from all instances, the optimizer makes the next step, and the updated state is communicated back to all instances, starting the process again. In Llama 3 405B, this seems to take about 6 seconds per minibatch, and there are 1M minibatches overall. Gemini 1.0 report states that
Training Gemini Ultra used a large fleet of TPUv4 accelerators owned by Google across multiple datacenters. … we combine SuperPods in multiple datacenters using Google’s intra-cluster and inter-cluster network. Google’s network latencies and bandwidths are sufficient to support the commonly used synchronous training paradigm, exploiting model parallelism within superpods and data-parallelism across superpods.
An update would need to communicate the weights, gradients, and some optimizer state, collecting from all clusters involved in training. This could be 3-4 times more than just the weights. But modern fiber optic can carry 30 Tbps per fiber pair (with about 100 links in different wavelength bands within a single fiber, each at about 400 Gbps), and a cable has many fibers. The total capacity of underwater cables one often hears about is on the order of 100 Tbps, but they typically only carry a few fiber pairs, while with 48 fiber pairs we can do 1.3 Pbps. Overland inter-datacenter cables can have even more fibers. So an inter-datacenter network with 100 Tbps dedicated for model training seems feasible to setup with some work. For a 10T parameter model in FP8, communicating 40TB of relevant data would then only take 3 seconds, or 6 for a round trip, comparable to the time for processing a minibatch.
And clusters don’t necessarily have to sit doing nothing between minibatches. They could be multiplexing training of more than one model at a time (if all fit in memory at the same time), or using some asynchronous training black magic that makes downtime unnecessary. So it’s unclear if there are speed or efficiency losses at all, but even conservatively it seems that training can remain only 2 times slower and more expensive in cost of time.
For the data wall, key points are ability to repeat data and optimal tokens per parameter (Chinchilla scaling laws). Recent measurements for tokens per parameter give about 40 (Llama 3, CARBS), up from 20 for Chinchilla, and for repeated data it goes to 50-60. The extrapolated value of tokens per parameter increases with FLOPs and might go up 50% over 3 OOMs, so I guess it could go as far as 80 tokens per parameters at 15 repetitions of data around 1e28 FLOPs. With a 50T token dataset, that’s enough to train with 9T active parameters, or use 4e28 FLOPs in training (Llama 3 405B uses 4e25 FLOPs). With 20% utilization of FP8 on an Nvidia Blackwell, that’s 2M GPUs running for 200 days, a $25 billion training run.
Claude 3 Opus, Llama 3 405B, and Claude 3.5 Sonnet are clearly somewhat better than original GPT 4, with maybe only 10x in FLOPs scaling at most since then. And there is at least 100x more to come within a few years. Planned datacenter buildup is 3 OOMs above largest training runs for currently deployed models, Llama 3 405B was trained using 16K H100s, while Nvidia shipped 3.7M GPUs in 2023. When a major part of progress is waiting for the training clusters to get built, it’s too early to call when they haven’t been built yet.
What do you mean by clearly somewhat better? I found Claude 3 Opus clearly worse for my coding tasks. GPT 4 went down for a while, and I was forced to swap, and found it really disappointing. Maximum data center size is more like 300K GPUs bc of power, bandwidth constraints etc. These ppl are optimistic, but I don’t believe we will meaningfully get above 300k https://www.nextbigfuture.com/2024/07/100-petaflop-ai-chip-and-100-zettaflop-ai-training-data-centers-in-2027.html
XAI, Tesla autopilot are already running equivalent of more than 15K GPU I expect, so 3 OOM more is not happening I expect.
The key point is that a training run is not fundamentally constrained to a single datacenter or a single campus. Doing otherwise is more complicated, likely less efficient, and at the scale of currently deployed models unnecessary. Another popular concern is the data wall. But it seems that there is still enough data even for future training runs that run across multiple datacenters, it just won’t allow making inference-efficient overtrained models using all that compute. Both points are based on conservative estimates that don’t assume algorithmic breakthroughs. Also, the current models are still trained quickly, while at the boundaries of technical feasiblity it would make more sense to perform very long training runs.
For training across multiple datacenters, one way is continuing to push data parallelism with minibatching. Many instances of a model are separately observing multiple samples, and once they are done the updates are collected from all instances, the optimizer makes the next step, and the updated state is communicated back to all instances, starting the process again. In Llama 3 405B, this seems to take about 6 seconds per minibatch, and there are 1M minibatches overall. Gemini 1.0 report states that
An update would need to communicate the weights, gradients, and some optimizer state, collecting from all clusters involved in training. This could be 3-4 times more than just the weights. But modern fiber optic can carry 30 Tbps per fiber pair (with about 100 links in different wavelength bands within a single fiber, each at about 400 Gbps), and a cable has many fibers. The total capacity of underwater cables one often hears about is on the order of 100 Tbps, but they typically only carry a few fiber pairs, while with 48 fiber pairs we can do 1.3 Pbps. Overland inter-datacenter cables can have even more fibers. So an inter-datacenter network with 100 Tbps dedicated for model training seems feasible to setup with some work. For a 10T parameter model in FP8, communicating 40TB of relevant data would then only take 3 seconds, or 6 for a round trip, comparable to the time for processing a minibatch.
And clusters don’t necessarily have to sit doing nothing between minibatches. They could be multiplexing training of more than one model at a time (if all fit in memory at the same time), or using some asynchronous training black magic that makes downtime unnecessary. So it’s unclear if there are speed or efficiency losses at all, but even conservatively it seems that training can remain only 2 times slower and more expensive in cost of time.
For the data wall, key points are ability to repeat data and optimal tokens per parameter (Chinchilla scaling laws). Recent measurements for tokens per parameter give about 40 (Llama 3, CARBS), up from 20 for Chinchilla, and for repeated data it goes to 50-60. The extrapolated value of tokens per parameter increases with FLOPs and might go up 50% over 3 OOMs, so I guess it could go as far as 80 tokens per parameters at 15 repetitions of data around 1e28 FLOPs. With a 50T token dataset, that’s enough to train with 9T active parameters, or use 4e28 FLOPs in training (Llama 3 405B uses 4e25 FLOPs). With 20% utilization of FP8 on an Nvidia Blackwell, that’s 2M GPUs running for 200 days, a $25 billion training run.