And xAI was working on a 100k H100 cluster that was on track to be finished in July.
According to DCD, that should be fall 2025. Planned power is 150 megawatts or possibly 50+150 megawatts, which is good for 100K H100s, but not more than that. The request for the 150 megawatts is still being discussed by the utilities, as of August 2024. Any future Blackwells will need to go elsewhere, the whole plan for this datacenter seems to be the 100K H100s. (This costs about $5bn, and xAI only closed its $6bn Series B in May 2024.)
according to Elon … Grok 2 was trained on 24k H100s
This scale seems to be available from AWS, and takes about a month to invest GPT-4 levels of compute. Grok-2 was probably rushed, once it was ready to train, in order to finally get a 4-level model, so it didn’t train for very long. If 100K H100s clusters remain impossible to access, and the full Memphis datacenter won’t get online at least for months yet (with significantly more H100s than 24K), it seems that the reasonable thing right now is to simply train on 24K H100s for more months. That’s probably going to be Grok-3.
Unless Elon is lying, it was operational as of July, though perhaps only with about 32k of the H100s rather than all of them. My understanding is that at least 64k are operational now.
The request for the 150 megawatts is still being discussed by the utilities, as of August 2024.
Yes, though mobile generators are in use which could power at least a large fraction of the H100s. See discussion here.
Musk claims xAI built a cluster of 100,000 Nvidia H100 GPUs—one of the most advanced broadly available chips—in a facility in Memphis, Tenn.
In a post on Monday, Musk said the 100,000-chip cluster, known as Colossus, is already up and running and is the “most powerful AI training system in the world.” Two people with direct knowledge of xAI’s chip order and power capacity at the site said they believed that fewer than half of those chips are currently in operation, largely because of constraints involving power or networking gear.
Whether Musk’s claims are embellished or not, they have caused a stir among other top AI developers, which fear falling behind. OpenAI CEO Sam Altman, for instance, has told some Microsoft executives he is concerned that xAI could soon have more access to computing power than OpenAI does, according to someone who heard his comments.
Keep in mind Musk never said it was “fully online” or “100,000 GPUs are running concurrently” or anything like that. He only said that the cluster was “online”, which could mean just about anything, and that it is “the most powerful AI training system”, which is unfalsifiable (who can know how powerful every AI training system is worldwide, including all of the secret proprietary ones by FANG etc?) and obvious pure puffery (“best pizza in the world!”). If you fell for it, well, then the tweet was for you.
I wonder if it’s all running on generators, and what this means about Grok-3. With 30K H100s, 1.5 months only get 4e25 FLOPs, the Llama-3 compute. I’m guessing they’d want 1e26 FLOPs or so to get a meaningful improvement over Grok-2, which is 2 more months. But in 2 months, 100K H100s give 1.6e26 FLOPs (I’m assuming slightly worse utilization).
Maybe figuring out how to be efficient with including more compute into a run that has already started is part of the plan, so that in a few more months the mentioned scaleup to further 50K H100s and 50K H200s could happen mid-run for Grok-4? Sounds dubious.
Memphis datacenter might be operational in some form, but the 100K H100s cluster is not operational, and I was responding to elifland’s specific claim about “a 100k H100 cluster that was on track to be finished in July”. The point is, the scale that’s beyond what you can get from AWS is not going to be available for some time. This is a point journalists repeatedly got wrong, what is claimed is that something is operational in July, and that the datacenter is planned to have 100K H100s, but it doesn’t follow that 100K H100s are operational in July.
By analogy with Llama-3-405b, Grok-2 started training no later than Mar-Apr 2024 (it needs to finish pre-training, and then go through RLHF), so it wasn’t trained using the Memphis datacenter. And in its current state, the Memphis datacenter won’t significantly improve on that scale, the bulk of the improvement would need to come from training for more months. If by the end of 2024, both 100K H100s and the 150 megawatts substation are ready, then xAI will start to catch up with OpenAI, which might already be training at that scale since May.
mobile generators are in use which could power at least a large fraction of the H100s. See discussion here.
So Grok-3 is probably using these 30K H100s instead of rented compute like Grok-2. This seems to be a wash in terms of scale, more a way of keeping the 30K H100s in use and getting experience for the subsequent 100K run. Targeting end of 2024 for Grok-3 release means it finishes pre-training in late 2024, maybe Oct-Nov 2024 (leaving some time for RLHF until end of 2024), so this is some evidence for end of 2024 as the time when 100K H100s get online, otherwise Grok-3 could be trained for longer. As it is, it’s going to get about 1e26 FLOPs. Since Grok-1 was MoE (unlike Llama-3-405b), this has a chance of being better than current SOTA as of Aug 2024, but by the end of 2024 there might already be Claude 3.5 Opus or a new Gemini.
According to DCD, that should be fall 2025. Planned power is 150 megawatts or possibly 50+150 megawatts, which is good for 100K H100s, but not more than that. The request for the 150 megawatts is still being discussed by the utilities, as of August 2024. Any future Blackwells will need to go elsewhere, the whole plan for this datacenter seems to be the 100K H100s. (This costs about $5bn, and xAI only closed its $6bn Series B in May 2024.)
This scale seems to be available from AWS, and takes about a month to invest GPT-4 levels of compute. Grok-2 was probably rushed, once it was ready to train, in order to finally get a 4-level model, so it didn’t train for very long. If 100K H100s clusters remain impossible to access, and the full Memphis datacenter won’t get online at least for months yet (with significantly more H100s than 24K), it seems that the reasonable thing right now is to simply train on 24K H100s for more months. That’s probably going to be Grok-3.
Unless Elon is lying, it was operational as of July, though perhaps only with about 32k of the H100s rather than all of them. My understanding is that at least 64k are operational now.
Yes, though mobile generators are in use which could power at least a large fraction of the H100s. See discussion here.
Seems to be fully online as of now (Sep. 2) based on this tweet?
I now think this is false. From The Information:
Keep in mind Musk never said it was “fully online” or “100,000 GPUs are running concurrently” or anything like that. He only said that the cluster was “online”, which could mean just about anything, and that it is “the most powerful AI training system”, which is unfalsifiable (who can know how powerful every AI training system is worldwide, including all of the secret proprietary ones by FANG etc?) and obvious pure puffery (“best pizza in the world!”). If you fell for it, well, then the tweet was for you.
I wonder if it’s all running on generators, and what this means about Grok-3. With 30K H100s, 1.5 months only get 4e25 FLOPs, the Llama-3 compute. I’m guessing they’d want 1e26 FLOPs or so to get a meaningful improvement over Grok-2, which is 2 more months. But in 2 months, 100K H100s give 1.6e26 FLOPs (I’m assuming slightly worse utilization).
Maybe figuring out how to be efficient with including more compute into a run that has already started is part of the plan, so that in a few more months the mentioned scaleup to further 50K H100s and 50K H200s could happen mid-run for Grok-4? Sounds dubious.
Memphis datacenter might be operational in some form, but the 100K H100s cluster is not operational, and I was responding to elifland’s specific claim about “a 100k H100 cluster that was on track to be finished in July”. The point is, the scale that’s beyond what you can get from AWS is not going to be available for some time. This is a point journalists repeatedly got wrong, what is claimed is that something is operational in July, and that the datacenter is planned to have 100K H100s, but it doesn’t follow that 100K H100s are operational in July.
By analogy with Llama-3-405b, Grok-2 started training no later than Mar-Apr 2024 (it needs to finish pre-training, and then go through RLHF), so it wasn’t trained using the Memphis datacenter. And in its current state, the Memphis datacenter won’t significantly improve on that scale, the bulk of the improvement would need to come from training for more months. If by the end of 2024, both 100K H100s and the 150 megawatts substation are ready, then xAI will start to catch up with OpenAI, which might already be training at that scale since May.
So Grok-3 is probably using these 30K H100s instead of rented compute like Grok-2. This seems to be a wash in terms of scale, more a way of keeping the 30K H100s in use and getting experience for the subsequent 100K run. Targeting end of 2024 for Grok-3 release means it finishes pre-training in late 2024, maybe Oct-Nov 2024 (leaving some time for RLHF until end of 2024), so this is some evidence for end of 2024 as the time when 100K H100s get online, otherwise Grok-3 could be trained for longer. As it is, it’s going to get about 1e26 FLOPs. Since Grok-1 was MoE (unlike Llama-3-405b), this has a chance of being better than current SOTA as of Aug 2024, but by the end of 2024 there might already be Claude 3.5 Opus or a new Gemini.