Musk claims xAI built a cluster of 100,000 Nvidia H100 GPUs—one of the most advanced broadly available chips—in a facility in Memphis, Tenn.
In a post on Monday, Musk said the 100,000-chip cluster, known as Colossus, is already up and running and is the “most powerful AI training system in the world.” Two people with direct knowledge of xAI’s chip order and power capacity at the site said they believed that fewer than half of those chips are currently in operation, largely because of constraints involving power or networking gear.
Whether Musk’s claims are embellished or not, they have caused a stir among other top AI developers, which fear falling behind. OpenAI CEO Sam Altman, for instance, has told some Microsoft executives he is concerned that xAI could soon have more access to computing power than OpenAI does, according to someone who heard his comments.
Keep in mind Musk never said it was “fully online” or “100,000 GPUs are running concurrently” or anything like that. He only said that the cluster was “online”, which could mean just about anything, and that it is “the most powerful AI training system”, which is unfalsifiable (who can know how powerful every AI training system is worldwide, including all of the secret proprietary ones by FANG etc?) and obvious pure puffery (“best pizza in the world!”). If you fell for it, well, then the tweet was for you.
I wonder if it’s all running on generators, and what this means about Grok-3. With 30K H100s, 1.5 months only get 4e25 FLOPs, the Llama-3 compute. I’m guessing they’d want 1e26 FLOPs or so to get a meaningful improvement over Grok-2, which is 2 more months. But in 2 months, 100K H100s give 1.6e26 FLOPs (I’m assuming slightly worse utilization).
Maybe figuring out how to be efficient with including more compute into a run that has already started is part of the plan, so that in a few more months the mentioned scaleup to further 50K H100s and 50K H200s could happen mid-run for Grok-4? Sounds dubious.
Seems to be fully online as of now (Sep. 2) based on this tweet?
I now think this is false. From The Information:
Keep in mind Musk never said it was “fully online” or “100,000 GPUs are running concurrently” or anything like that. He only said that the cluster was “online”, which could mean just about anything, and that it is “the most powerful AI training system”, which is unfalsifiable (who can know how powerful every AI training system is worldwide, including all of the secret proprietary ones by FANG etc?) and obvious pure puffery (“best pizza in the world!”). If you fell for it, well, then the tweet was for you.
I wonder if it’s all running on generators, and what this means about Grok-3. With 30K H100s, 1.5 months only get 4e25 FLOPs, the Llama-3 compute. I’m guessing they’d want 1e26 FLOPs or so to get a meaningful improvement over Grok-2, which is 2 more months. But in 2 months, 100K H100s give 1.6e26 FLOPs (I’m assuming slightly worse utilization).
Maybe figuring out how to be efficient with including more compute into a run that has already started is part of the plan, so that in a few more months the mentioned scaleup to further 50K H100s and 50K H200s could happen mid-run for Grok-4? Sounds dubious.