I wonder if it’s all running on generators, and what this means about Grok-3. With 30K H100s, 1.5 months only get 4e25 FLOPs, the Llama-3 compute. I’m guessing they’d want 1e26 FLOPs or so to get a meaningful improvement over Grok-2, which is 2 more months. But in 2 months, 100K H100s give 1.6e26 FLOPs (I’m assuming slightly worse utilization).
Maybe figuring out how to be efficient with including more compute into a run that has already started is part of the plan, so that in a few more months the mentioned scaleup to further 50K H100s and 50K H200s could happen mid-run for Grok-4? Sounds dubious.
I wonder if it’s all running on generators, and what this means about Grok-3. With 30K H100s, 1.5 months only get 4e25 FLOPs, the Llama-3 compute. I’m guessing they’d want 1e26 FLOPs or so to get a meaningful improvement over Grok-2, which is 2 more months. But in 2 months, 100K H100s give 1.6e26 FLOPs (I’m assuming slightly worse utilization).
Maybe figuring out how to be efficient with including more compute into a run that has already started is part of the plan, so that in a few more months the mentioned scaleup to further 50K H100s and 50K H200s could happen mid-run for Grok-4? Sounds dubious.