In general, I don’t understand linking scaling difficulties to max scale-up world size. I believe the bandwidth/latency of IB H100 clusters does not present a hard problem for current hyperscalers on other parallelisms.
Pipeline parallelism doesn’t reduce batch size, it just moves the processing of a given sequence around the cluster in stages, but the number of sequences being processed by the cluster at a given time doesn’t change (the time needed to process a layer for some sequence doesn’t change, so the time between optimizer steps doesn’t change, other than through bubbles). Tensor parallelism spreads the processing of a sequence across multiple GPUs, so there are fewer sequences processed at once within the cluster, which can be used to reduce the batch size (the time needed to process a layer for some sequence is divided by degree of tensor parallelism, so the time between optimizer steps reduces, and so does the total compute expended in a batch, proportional to the total number of sequences in it). You can only do tensor parallelism within a scale-up world without murdering compute utilization, which puts a bound on how much you can reduce the batch size.
I believe the l3 paper indicates the training seqlen was increased mid-training.
Section 3.4 says they start with sequences of length 4K, move to sequences of length 8K after 250M tokens, then to 16M tokens per batch after 2.9T tokens, and finally to long context training in the last 800B tokens (out of about 15T tokens in total). So 11T out of 15T tokens were learned in batches of 2K sequences of length 8K.
I think it’s plausible the combination of torus topology + poor PCIe5.0 bw/latency will make a full TP=64 Trn2 config underform your expectations
Good catch, TP=32 on 400K Trn2 gives the same batch size as TP=8 on 100K H100, so there is only an advantage with TP=64, which is not a priori a sure thing to work well. And a hypothetical non-Ultra 400K Trn2 cluster with its 16 GPU scale-up worlds is worse even though there’s more compute in 16 Trn2 than in 8 H100. Though it would be surprising if the Rainier cluster doesn’t have the Ultra config, as what else is it supposed to be for.
Pipeline parallelism doesn’t reduce batch size, it just moves the processing of a given sequence around the cluster in stages, but the number of sequences being processed by the cluster at a given time doesn’t change (the time needed to process a layer for some sequence doesn’t change, so the time between optimizer steps doesn’t change, other than through bubbles). Tensor parallelism spreads the processing of a sequence across multiple GPUs, so there are fewer sequences processed at once within the cluster, which can be used to reduce the batch size (the time needed to process a layer for some sequence is divided by degree of tensor parallelism, so the time between optimizer steps reduces, and so does the total compute expended in a batch, proportional to the total number of sequences in it). You can only do tensor parallelism within a scale-up world without murdering compute utilization, which puts a bound on how much you can reduce the batch size.
Section 3.4 says they start with sequences of length 4K, move to sequences of length 8K after 250M tokens, then to 16M tokens per batch after 2.9T tokens, and finally to long context training in the last 800B tokens (out of about 15T tokens in total). So 11T out of 15T tokens were learned in batches of 2K sequences of length 8K.
Good catch, TP=32 on 400K Trn2 gives the same batch size as TP=8 on 100K H100, so there is only an advantage with TP=64, which is not a priori a sure thing to work well. And a hypothetical non-Ultra 400K Trn2 cluster with its 16 GPU scale-up worlds is worse even though there’s more compute in 16 Trn2 than in 8 H100. Though it would be surprising if the Rainier cluster doesn’t have the Ultra config, as what else is it supposed to be for.