[minor technical disputes below; ignore if disinterested]
This might be an issue when training on H100 at this scale[1] and explain some scaling difficulties for labs that are not Google, or Anthropic later in 2025 once the Trn2 cluster becomes useful.
Llama 3 405B was trained in minibatches with 2K sequences of 8K tokens, the smallest that 8-GPU scale-up domains of a 16K H100 cluster enable. If it was clearly optimal for minibatches to be larger, it’s trivial to make it so, so they are probably already too large.
I’m a bit confused by this part. I believe the l3 paper indicates the training seqlen was increased mid-training.
In general, I don’t understand linking scaling difficulties to max scale-up world size. I believe the bandwidth/latency of IB H100 clusters does not present a hard problem for current hyperscalers on other parallelisms.
For H100, that’s only 8 GPUs in the standard configuration that seems to be used everywhere. For TPUv6e, that’s a whole 256-chip pod, and this wasn’t a constraint in older TPUs either. For Trn2, that’s either 16 or 64 GPUs in either standard or Ultra variants.
I think it’s plausible the combination of torus topology + poor PCIe5.0 bw/latency will make a full TP=64 Trn2 config underform your expectations, but we may have to wait for Semianalysis to provide good numbers on this.
In general, I don’t understand linking scaling difficulties to max scale-up world size. I believe the bandwidth/latency of IB H100 clusters does not present a hard problem for current hyperscalers on other parallelisms.
Pipeline parallelism doesn’t reduce batch size, it just moves the processing of a given sequence around the cluster in stages, but the number of sequences being processed by the cluster at a given time doesn’t change (the time needed to process a layer for some sequence doesn’t change, so the time between optimizer steps doesn’t change, other than through bubbles). Tensor parallelism spreads the processing of a sequence across multiple GPUs, so there are fewer sequences processed at once within the cluster, which can be used to reduce the batch size (the time needed to process a layer for some sequence is divided by degree of tensor parallelism, so the time between optimizer steps reduces, and so does the total compute expended in a batch, proportional to the total number of sequences in it). You can only do tensor parallelism within a scale-up world without murdering compute utilization, which puts a bound on how much you can reduce the batch size.
I believe the l3 paper indicates the training seqlen was increased mid-training.
Section 3.4 says they start with sequences of length 4K, move to sequences of length 8K after 250M tokens, then to 16M tokens per batch after 2.9T tokens, and finally to long context training in the last 800B tokens (out of about 15T tokens in total). So 11T out of 15T tokens were learned in batches of 2K sequences of length 8K.
I think it’s plausible the combination of torus topology + poor PCIe5.0 bw/latency will make a full TP=64 Trn2 config underform your expectations
Good catch, TP=32 on 400K Trn2 gives the same batch size as TP=8 on 100K H100, so there is only an advantage with TP=64, which is not a priori a sure thing to work well. And a hypothetical non-Ultra 400K Trn2 cluster with its 16 GPU scale-up worlds is worse even though there’s more compute in 16 Trn2 than in 8 H100. Though it would be surprising if the Rainier cluster doesn’t have the Ultra config, as what else is it supposed to be for.
[minor technical disputes below; ignore if disinterested]
I’m a bit confused by this part. I believe the l3 paper indicates the training seqlen was increased mid-training.
In general, I don’t understand linking scaling difficulties to max scale-up world size. I believe the bandwidth/latency of IB H100 clusters does not present a hard problem for current hyperscalers on other parallelisms.
I think it’s plausible the combination of torus topology + poor PCIe5.0 bw/latency will make a full TP=64 Trn2 config underform your expectations, but we may have to wait for Semianalysis to provide good numbers on this.
Pipeline parallelism doesn’t reduce batch size, it just moves the processing of a given sequence around the cluster in stages, but the number of sequences being processed by the cluster at a given time doesn’t change (the time needed to process a layer for some sequence doesn’t change, so the time between optimizer steps doesn’t change, other than through bubbles). Tensor parallelism spreads the processing of a sequence across multiple GPUs, so there are fewer sequences processed at once within the cluster, which can be used to reduce the batch size (the time needed to process a layer for some sequence is divided by degree of tensor parallelism, so the time between optimizer steps reduces, and so does the total compute expended in a batch, proportional to the total number of sequences in it). You can only do tensor parallelism within a scale-up world without murdering compute utilization, which puts a bound on how much you can reduce the batch size.
Section 3.4 says they start with sequences of length 4K, move to sequences of length 8K after 250M tokens, then to 16M tokens per batch after 2.9T tokens, and finally to long context training in the last 800B tokens (out of about 15T tokens in total). So 11T out of 15T tokens were learned in batches of 2K sequences of length 8K.
Good catch, TP=32 on 400K Trn2 gives the same batch size as TP=8 on 100K H100, so there is only an advantage with TP=64, which is not a priori a sure thing to work well. And a hypothetical non-Ultra 400K Trn2 cluster with its 16 GPU scale-up worlds is worse even though there’s more compute in 16 Trn2 than in 8 H100. Though it would be surprising if the Rainier cluster doesn’t have the Ultra config, as what else is it supposed to be for.