Unless Elon is lying, it was operational as of July, though perhaps only with about 32k of the H100s rather than all of them. My understanding is that at least 64k are operational now.
The request for the 150 megawatts is still being discussed by the utilities, as of August 2024.
Yes, though mobile generators are in use which could power at least a large fraction of the H100s. See discussion here.
Musk claims xAI built a cluster of 100,000 Nvidia H100 GPUs—one of the most advanced broadly available chips—in a facility in Memphis, Tenn.
In a post on Monday, Musk said the 100,000-chip cluster, known as Colossus, is already up and running and is the “most powerful AI training system in the world.” Two people with direct knowledge of xAI’s chip order and power capacity at the site said they believed that fewer than half of those chips are currently in operation, largely because of constraints involving power or networking gear.
Whether Musk’s claims are embellished or not, they have caused a stir among other top AI developers, which fear falling behind. OpenAI CEO Sam Altman, for instance, has told some Microsoft executives he is concerned that xAI could soon have more access to computing power than OpenAI does, according to someone who heard his comments.
Keep in mind Musk never said it was “fully online” or “100,000 GPUs are running concurrently” or anything like that. He only said that the cluster was “online”, which could mean just about anything, and that it is “the most powerful AI training system”, which is unfalsifiable (who can know how powerful every AI training system is worldwide, including all of the secret proprietary ones by FANG etc?) and obvious pure puffery (“best pizza in the world!”). If you fell for it, well, then the tweet was for you.
I wonder if it’s all running on generators, and what this means about Grok-3. With 30K H100s, 1.5 months only get 4e25 FLOPs, the Llama-3 compute. I’m guessing they’d want 1e26 FLOPs or so to get a meaningful improvement over Grok-2, which is 2 more months. But in 2 months, 100K H100s give 1.6e26 FLOPs (I’m assuming slightly worse utilization).
Maybe figuring out how to be efficient with including more compute into a run that has already started is part of the plan, so that in a few more months the mentioned scaleup to further 50K H100s and 50K H200s could happen mid-run for Grok-4? Sounds dubious.
Memphis datacenter might be operational in some form, but the 100K H100s cluster is not operational, and I was responding to elifland’s specific claim about “a 100k H100 cluster that was on track to be finished in July”. The point is, the scale that’s beyond what you can get from AWS is not going to be available for some time. This is a point journalists repeatedly got wrong, what is claimed is that something is operational in July, and that the datacenter is planned to have 100K H100s, but it doesn’t follow that 100K H100s are operational in July.
By analogy with Llama-3-405b, Grok-2 started training no later than Mar-Apr 2024 (it needs to finish pre-training, and then go through RLHF), so it wasn’t trained using the Memphis datacenter. And in its current state, the Memphis datacenter won’t significantly improve on that scale, the bulk of the improvement would need to come from training for more months. If by the end of 2024, both 100K H100s and the 150 megawatts substation are ready, then xAI will start to catch up with OpenAI, which might already be training at that scale since May.
mobile generators are in use which could power at least a large fraction of the H100s. See discussion here.
So Grok-3 is probably using these 30K H100s instead of rented compute like Grok-2. This seems to be a wash in terms of scale, more a way of keeping the 30K H100s in use and getting experience for the subsequent 100K run. Targeting end of 2024 for Grok-3 release means it finishes pre-training in late 2024, maybe Oct-Nov 2024 (leaving some time for RLHF until end of 2024), so this is some evidence for end of 2024 as the time when 100K H100s get online, otherwise Grok-3 could be trained for longer. As it is, it’s going to get about 1e26 FLOPs. Since Grok-1 was MoE (unlike Llama-3-405b), this has a chance of being better than current SOTA as of Aug 2024, but by the end of 2024 there might already be Claude 3.5 Opus or a new Gemini.
Unless Elon is lying, it was operational as of July, though perhaps only with about 32k of the H100s rather than all of them. My understanding is that at least 64k are operational now.
Yes, though mobile generators are in use which could power at least a large fraction of the H100s. See discussion here.
Seems to be fully online as of now (Sep. 2) based on this tweet?
I now think this is false. From The Information:
Keep in mind Musk never said it was “fully online” or “100,000 GPUs are running concurrently” or anything like that. He only said that the cluster was “online”, which could mean just about anything, and that it is “the most powerful AI training system”, which is unfalsifiable (who can know how powerful every AI training system is worldwide, including all of the secret proprietary ones by FANG etc?) and obvious pure puffery (“best pizza in the world!”). If you fell for it, well, then the tweet was for you.
I wonder if it’s all running on generators, and what this means about Grok-3. With 30K H100s, 1.5 months only get 4e25 FLOPs, the Llama-3 compute. I’m guessing they’d want 1e26 FLOPs or so to get a meaningful improvement over Grok-2, which is 2 more months. But in 2 months, 100K H100s give 1.6e26 FLOPs (I’m assuming slightly worse utilization).
Maybe figuring out how to be efficient with including more compute into a run that has already started is part of the plan, so that in a few more months the mentioned scaleup to further 50K H100s and 50K H200s could happen mid-run for Grok-4? Sounds dubious.
Memphis datacenter might be operational in some form, but the 100K H100s cluster is not operational, and I was responding to elifland’s specific claim about “a 100k H100 cluster that was on track to be finished in July”. The point is, the scale that’s beyond what you can get from AWS is not going to be available for some time. This is a point journalists repeatedly got wrong, what is claimed is that something is operational in July, and that the datacenter is planned to have 100K H100s, but it doesn’t follow that 100K H100s are operational in July.
By analogy with Llama-3-405b, Grok-2 started training no later than Mar-Apr 2024 (it needs to finish pre-training, and then go through RLHF), so it wasn’t trained using the Memphis datacenter. And in its current state, the Memphis datacenter won’t significantly improve on that scale, the bulk of the improvement would need to come from training for more months. If by the end of 2024, both 100K H100s and the 150 megawatts substation are ready, then xAI will start to catch up with OpenAI, which might already be training at that scale since May.
So Grok-3 is probably using these 30K H100s instead of rented compute like Grok-2. This seems to be a wash in terms of scale, more a way of keeping the 30K H100s in use and getting experience for the subsequent 100K run. Targeting end of 2024 for Grok-3 release means it finishes pre-training in late 2024, maybe Oct-Nov 2024 (leaving some time for RLHF until end of 2024), so this is some evidence for end of 2024 as the time when 100K H100s get online, otherwise Grok-3 could be trained for longer. As it is, it’s going to get about 1e26 FLOPs. Since Grok-1 was MoE (unlike Llama-3-405b), this has a chance of being better than current SOTA as of Aug 2024, but by the end of 2024 there might already be Claude 3.5 Opus or a new Gemini.