If so it seems to be getting a lot less attention compared to its compute capability.
Not sure if I have stated this before clearly but I believe scaling laws will not hold for LLM/Transformer type tech, and at least one major architectural advance is missing before AGI. That is increasing scaling of compute and data will plateau performance soon, and before AGI. Therefore I expect to see evidence for this not much after the end of this year, when large training runs yield models that are a lot more expensive to train, slower on inference and only a little better on performance. X.AI could be one of the first to publicly let this be known (Open AI, etc could very well be aware of this but not making it public)
Completion of the 100K H100s cluster seems to mean Grok-3 won’t be trained only on a smaller part of it, so it must be targeting all of it. But also Musk said Grok-3 is planned for end of 2024. So it won’t get more than about 2.7e26 FLOPs, about 14x GPT-4 (the training that started end of July could have just used a larger mini-batch size that anticipates the data parallelism needs of the larger cluster, so the same run could continue all the way from July to November). With 6 months of training on the whole cluster, it could instead get up to 5e26 FLOPs (25x GPT-4), but that needs to wait for another run.
On the other hand, with about 20K H100s, which is the scale that was offered at AWS in July 2023 and might’ve been available at Microsoft internally even earlier, it only takes 5 months to get 1e26 FLOPs. So GPT-4o might already be a 5x GPT-4 model. But it also could be an overtrained model (to get better inference efficiency), so not expected to be fundamentally much smarter.
Google has very large datacenters, if measured in megawatts, but they are filled with older TPUs. Maybe they are fine compared to H100s on FLOP/joule basis though? In BF16, A100 (0.3e15 FLOP/s, 400W) to H100 (1e15 FLOP/s, 700W) to B100 (1.8e15 FLOP/s, 700W) notably improve FLOP/joule, but for recent TPUs TDP is not disclosed (and the corresponding fraction of the rest of the datacenter needs to be taken into account, for example it turns 700W of an H100 into about 1500W). In terms of FLOPS/GPU, only the latest generation announced in May 2024 matches H100s, it might take time to install enough of them.
They seem to have big plans for next year, but possibly they are not yet quite ready to be significantly ahead of 100K H100s clusters.
Is X.AI currently performing the largest training run?
This source claims it is
PC mag not so sure here, here
If so it seems to be getting a lot less attention compared to its compute capability.
Not sure if I have stated this before clearly but I believe scaling laws will not hold for LLM/Transformer type tech, and at least one major architectural advance is missing before AGI. That is increasing scaling of compute and data will plateau performance soon, and before AGI. Therefore I expect to see evidence for this not much after the end of this year, when large training runs yield models that are a lot more expensive to train, slower on inference and only a little better on performance. X.AI could be one of the first to publicly let this be known (Open AI, etc could very well be aware of this but not making it public)
Completion of the 100K H100s cluster seems to mean Grok-3 won’t be trained only on a smaller part of it, so it must be targeting all of it. But also Musk said Grok-3 is planned for end of 2024. So it won’t get more than about 2.7e26 FLOPs, about 14x GPT-4 (the training that started end of July could have just used a larger mini-batch size that anticipates the data parallelism needs of the larger cluster, so the same run could continue all the way from July to November). With 6 months of training on the whole cluster, it could instead get up to 5e26 FLOPs (25x GPT-4), but that needs to wait for another run.
OpenAI is plausibly training on Microsoft’s 100K H100s cluster since May, but there are also claims of the first run only using 10x GPT-4 compute, which is 2e26 FLOPs, so it’d take only 2-3 months and pretraining should’ve concluded by now. Additionally, it’s probably using synthetic data at scale in pretraining, so if that has an effect, Grok-3′s hypothetically similar compute won’t be sufficient to match the result.
On the other hand, with about 20K H100s, which is the scale that was offered at AWS in July 2023 and might’ve been available at Microsoft internally even earlier, it only takes 5 months to get 1e26 FLOPs. So GPT-4o might already be a 5x GPT-4 model. But it also could be an overtrained model (to get better inference efficiency), so not expected to be fundamentally much smarter.
On priors I think that Google Deepmind is currently running the biggest training run.
Google has very large datacenters, if measured in megawatts, but they are filled with older TPUs. Maybe they are fine compared to H100s on FLOP/joule basis though? In BF16, A100 (0.3e15 FLOP/s, 400W) to H100 (1e15 FLOP/s, 700W) to B100 (1.8e15 FLOP/s, 700W) notably improve FLOP/joule, but for recent TPUs TDP is not disclosed (and the corresponding fraction of the rest of the datacenter needs to be taken into account, for example it turns 700W of an H100 into about 1500W). In terms of FLOPS/GPU, only the latest generation announced in May 2024 matches H100s, it might take time to install enough of them.
They seem to have big plans for next year, but possibly they are not yet quite ready to be significantly ahead of 100K H100s clusters.
Thanks, that updates me. I’ve been enjoying your well-informed comments on big training runs, thank you!