Vladimir_Nesov comments on RussellThor’s Shortform

Vladimir_Nesov 5 Sep 2024 15:09 UTC
5 points
0
Completion of the 100K H100s cluster seems to mean Grok-3 won’t be trained only on a smaller part of it, so it must be targeting all of it. But also Musk said Grok-3 is planned for end of 2024. So it won’t get more than about 2.7e26 FLOPs, about 14x GPT-4 (the training that started end of July could have just used a larger mini-batch size that anticipates the data parallelism needs of the larger cluster, so the same run could continue all the way from July to November). With 6 months of training on the whole cluster, it could instead get up to 5e26 FLOPs (25x GPT-4), but that needs to wait for another run.

OpenAI is plausibly training on Microsoft’s 100K H100s cluster since May, but there are also claims of the first run only using 10x GPT-4 compute, which is 2e26 FLOPs, so it’d take only 2-3 months and pretraining should’ve concluded by now. Additionally, it’s probably using synthetic data at scale in pretraining, so if that has an effect, Grok-3′s hypothetically similar compute won’t be sufficient to match the result.

On the other hand, with about 20K H100s, which is the scale that was offered at AWS in July 2023 and might’ve been available at Microsoft internally even earlier, it only takes 5 months to get 1e26 FLOPs. So GPT-4o might already be a 5x GPT-4 model. But it also could be an overtrained model (to get better inference efficiency), so not expected to be fundamentally much smarter.