Training as it’s currently done needs to happen within a single cluster
I think that’s probably wrong, or at least effectively wrong. Gemini 1.0, trained a year ago has the following info in the technical report:
TPUv4 accelerators are deployed in “SuperPods” of 4096 chips... TPU accelerators primarily communicate over the high speed inter-chip-interconnect, but at Gemini Ultra scale, we combine SuperPods in multiple datacenters using Google’s intra-cluster and inter-cluster network (Poutievski et al., 2022; Wetherall et al., 2023; yao Hong et al., 2018). Google’s network latencies and bandwidths are sufficient to support the commonly used synchronous training paradigm, exploiting model parallelism within superpods and data-parallelism across superpods.
As you note, public distributed training methods have advanced beyond basic data parallelism (though they have not been publicly shown at large model scales because nobody has really tried yet).
This might require bandwidth of about 300 Tbps for 500K B200s systems (connecting their geographically distributed parts), based on the below estimate. It gets worse with scale.
The “cluster” label applied in this context might be a bit of a stretch, for example the Llama 3 24K H100s cluster is organized in pods of 3072 GPUs, and the pods themselves are unambiguously clusters, but at the top level they are connected with 1:7 oversubscription (Section 3.3.1).
Only averaged gradients need to be exchanged at the top level, once at each optimizer step (minibatch). Llama 3 405B has about 1M minibatches with about 6 seconds per step[1], which means latency doesn’t matter, only bandwidth. I’m not sure what precision is appropriate for averaging gradients, but at 4 bytes per weight that’s 1.6TB of data to be sent each way in much less than 6 seconds, say in 1 second. This is bandwidth of 12 Tbps, which fits in what a single fiber of a fiber optic cable can transmit. Overland cables are laid with hundreds of fibers, so datacenters within the US can probably get at least one fiber of bandwidth between them.
Overly large minibatches are bad for quality of training, and with H100s in a standard setup only 8 GPUs are within NVLink scaleup domains that enable tensor parallelism. If each token sequence is processed on 8 GPUs (at a given stage of pipeline parallelism), that makes it necessary to process 2K sequences at once (Llama 3 only uses 16K GPUs in its training), and with 8K tokens per sequence that’s our 16M tokens per minibatch, for 1M minibatches[2]. But if scaleup domains were larger and enabled more tensor parallelism (for an appropriately large model), there would be fewer sequences processed simultaneously for smaller minibatches, so the time between optimizer steps would decrease, from Llama 3 405B’s 6 seconds down to less than that, making the necessary gradient communication bandwidth higher.
Some B200s come as NVL72 machines with 72 GPUs per scaleup domain. And with more weights there’ll be more data in the gradients for those models. Llama 3 405B has 16Kx53K matrices and 8K token sequences, so at 3TB/s and 1e15 FLOP/s (in an H100), you need tiles of size at least 1000x1000 to get sufficient arithmetic intensity. The scaleup network is a bit over 3 times slower than HBM, which is almost sufficient to move along the results (and starts to fit if we increase the inner dimension, with the tiles no longer square). So as far as I understand (could be very wrong, without experience to anchor the numbers), in principle there is enough there for a bit less than 8 times 16 times 53 GPUs to work with (tiling multiplication of a 16Kx53K matrix by a 53Kx8K matrix in squares of 1Kx1K), more than 1000 of such GPUs could participate in tensor parallelism for Llama 3 405B if the network could handle it, so in particular the 72 GPUs of NVL72 are few enough that they could run such multiplications with tensor parallelism.
With 72 B200s per NVLink domain in a 500K B200s system, that’s 7K sequences per minibatch, 3x more than for Llama 3 405B[3]. The compute per second, and so per training run, is larger than with 16K H100s by a factor of 80, so by Chinchilla scaling law a dense model would be about 9 times larger, 3.5T parameters. So the model is 9x larger, processed over 9x more GPUs (per NVLink domain) that are 2.5 times faster, which means an optimizer step is 2.5 times shorter. This assumes that the sequence length stays 8K (if it’s higher then so is the time between optimizer steps, reducing the necessary bandwidth). Transmitting gradients for 9x more weights in that time requires bandwidth that’s 20 times higher, about 300 Tbps.
That’s still within the realm of possibility, some oceanfloor cables feature bandwidth on the same order of magnitude, and overland cables should enable more, but it’s no longer likely to be trivial, could require actually laying the cables between the datacenter campus sites, which could take a long time to get all the permissions and to do the construction.
16K GPUs at 40% utilization for about 4e25 dense BF16 FLOPs, which is 40% of 1e15 FLOP/s for each GPU. And 16M tokens/minibatch (Table 4) out of about 16T tokens in total.
This gives another way of getting the estimate of 6 seconds per step, which doesn’t depend on the size of the cluster at all. The compute for 1 sequence is 6 times 405B parameters times 8K tokens, processed by 8 GPUs (at some pipeline parallelism stage), each at a rate of 1e15 FLOP/s with 40% utilization on average, so it takes them 6 seconds to process a sequence.
So making NVLink domains 9x larger only kept the problem of large minibatches from getting more than 3 times worse. This is still much better than 150K sequences per minibatch if the same compute was assembled in the form of 1200K H100s with 8 GPUs per NVLink domain.
I think that’s probably wrong, or at least effectively wrong. Gemini 1.0, trained a year ago has the following info in the technical report:
As you note, public distributed training methods have advanced beyond basic data parallelism (though they have not been publicly shown at large model scales because nobody has really tried yet).
This might require bandwidth of about 300 Tbps for 500K B200s systems (connecting their geographically distributed parts), based on the below estimate. It gets worse with scale.
The “cluster” label applied in this context might be a bit of a stretch, for example the Llama 3 24K H100s cluster is organized in pods of 3072 GPUs, and the pods themselves are unambiguously clusters, but at the top level they are connected with 1:7 oversubscription (Section 3.3.1).
Only averaged gradients need to be exchanged at the top level, once at each optimizer step (minibatch). Llama 3 405B has about 1M minibatches with about 6 seconds per step[1], which means latency doesn’t matter, only bandwidth. I’m not sure what precision is appropriate for averaging gradients, but at 4 bytes per weight that’s 1.6TB of data to be sent each way in much less than 6 seconds, say in 1 second. This is bandwidth of 12 Tbps, which fits in what a single fiber of a fiber optic cable can transmit. Overland cables are laid with hundreds of fibers, so datacenters within the US can probably get at least one fiber of bandwidth between them.
Overly large minibatches are bad for quality of training, and with H100s in a standard setup only 8 GPUs are within NVLink scaleup domains that enable tensor parallelism. If each token sequence is processed on 8 GPUs (at a given stage of pipeline parallelism), that makes it necessary to process 2K sequences at once (Llama 3 only uses 16K GPUs in its training), and with 8K tokens per sequence that’s our 16M tokens per minibatch, for 1M minibatches[2]. But if scaleup domains were larger and enabled more tensor parallelism (for an appropriately large model), there would be fewer sequences processed simultaneously for smaller minibatches, so the time between optimizer steps would decrease, from Llama 3 405B’s 6 seconds down to less than that, making the necessary gradient communication bandwidth higher.
Some B200s come as NVL72 machines with 72 GPUs per scaleup domain. And with more weights there’ll be more data in the gradients for those models. Llama 3 405B has 16Kx53K matrices and 8K token sequences, so at 3TB/s and 1e15 FLOP/s (in an H100), you need tiles of size at least 1000x1000 to get sufficient arithmetic intensity. The scaleup network is a bit over 3 times slower than HBM, which is almost sufficient to move along the results (and starts to fit if we increase the inner dimension, with the tiles no longer square). So as far as I understand (could be very wrong, without experience to anchor the numbers), in principle there is enough there for a bit less than 8 times 16 times 53 GPUs to work with (tiling multiplication of a 16Kx53K matrix by a 53Kx8K matrix in squares of 1Kx1K), more than 1000 of such GPUs could participate in tensor parallelism for Llama 3 405B if the network could handle it, so in particular the 72 GPUs of NVL72 are few enough that they could run such multiplications with tensor parallelism.
With 72 B200s per NVLink domain in a 500K B200s system, that’s 7K sequences per minibatch, 3x more than for Llama 3 405B[3]. The compute per second, and so per training run, is larger than with 16K H100s by a factor of 80, so by Chinchilla scaling law a dense model would be about 9 times larger, 3.5T parameters. So the model is 9x larger, processed over 9x more GPUs (per NVLink domain) that are 2.5 times faster, which means an optimizer step is 2.5 times shorter. This assumes that the sequence length stays 8K (if it’s higher then so is the time between optimizer steps, reducing the necessary bandwidth). Transmitting gradients for 9x more weights in that time requires bandwidth that’s 20 times higher, about 300 Tbps.
That’s still within the realm of possibility, some oceanfloor cables feature bandwidth on the same order of magnitude, and overland cables should enable more, but it’s no longer likely to be trivial, could require actually laying the cables between the datacenter campus sites, which could take a long time to get all the permissions and to do the construction.
16K GPUs at 40% utilization for about 4e25 dense BF16 FLOPs, which is 40% of 1e15 FLOP/s for each GPU. And 16M tokens/minibatch (Table 4) out of about 16T tokens in total.
This gives another way of getting the estimate of 6 seconds per step, which doesn’t depend on the size of the cluster at all. The compute for 1 sequence is 6 times 405B parameters times 8K tokens, processed by 8 GPUs (at some pipeline parallelism stage), each at a rate of 1e15 FLOP/s with 40% utilization on average, so it takes them 6 seconds to process a sequence.
So making NVLink domains 9x larger only kept the problem of large minibatches from getting more than 3 times worse. This is still much better than 150K sequences per minibatch if the same compute was assembled in the form of 1200K H100s with 8 GPUs per NVLink domain.