Memphis datacenter might be operational in some form, but the 100K H100s cluster is not operational, and I was responding to elifland’s specific claim about “a 100k H100 cluster that was on track to be finished in July”. The point is, the scale that’s beyond what you can get from AWS is not going to be available for some time. This is a point journalists repeatedly got wrong, what is claimed is that something is operational in July, and that the datacenter is planned to have 100K H100s, but it doesn’t follow that 100K H100s are operational in July.
By analogy with Llama-3-405b, Grok-2 started training no later than Mar-Apr 2024 (it needs to finish pre-training, and then go through RLHF), so it wasn’t trained using the Memphis datacenter. And in its current state, the Memphis datacenter won’t significantly improve on that scale, the bulk of the improvement would need to come from training for more months. If by the end of 2024, both 100K H100s and the 150 megawatts substation are ready, then xAI will start to catch up with OpenAI, which might already be training at that scale since May.
mobile generators are in use which could power at least a large fraction of the H100s. See discussion here.
So Grok-3 is probably using these 30K H100s instead of rented compute like Grok-2. This seems to be a wash in terms of scale, more a way of keeping the 30K H100s in use and getting experience for the subsequent 100K run. Targeting end of 2024 for Grok-3 release means it finishes pre-training in late 2024, maybe Oct-Nov 2024 (leaving some time for RLHF until end of 2024), so this is some evidence for end of 2024 as the time when 100K H100s get online, otherwise Grok-3 could be trained for longer. As it is, it’s going to get about 1e26 FLOPs. Since Grok-1 was MoE (unlike Llama-3-405b), this has a chance of being better than current SOTA as of Aug 2024, but by the end of 2024 there might already be Claude 3.5 Opus or a new Gemini.
Memphis datacenter might be operational in some form, but the 100K H100s cluster is not operational, and I was responding to elifland’s specific claim about “a 100k H100 cluster that was on track to be finished in July”. The point is, the scale that’s beyond what you can get from AWS is not going to be available for some time. This is a point journalists repeatedly got wrong, what is claimed is that something is operational in July, and that the datacenter is planned to have 100K H100s, but it doesn’t follow that 100K H100s are operational in July.
By analogy with Llama-3-405b, Grok-2 started training no later than Mar-Apr 2024 (it needs to finish pre-training, and then go through RLHF), so it wasn’t trained using the Memphis datacenter. And in its current state, the Memphis datacenter won’t significantly improve on that scale, the bulk of the improvement would need to come from training for more months. If by the end of 2024, both 100K H100s and the 150 megawatts substation are ready, then xAI will start to catch up with OpenAI, which might already be training at that scale since May.
So Grok-3 is probably using these 30K H100s instead of rented compute like Grok-2. This seems to be a wash in terms of scale, more a way of keeping the 30K H100s in use and getting experience for the subsequent 100K run. Targeting end of 2024 for Grok-3 release means it finishes pre-training in late 2024, maybe Oct-Nov 2024 (leaving some time for RLHF until end of 2024), so this is some evidence for end of 2024 as the time when 100K H100s get online, otherwise Grok-3 could be trained for longer. As it is, it’s going to get about 1e26 FLOPs. Since Grok-1 was MoE (unlike Llama-3-405b), this has a chance of being better than current SOTA as of Aug 2024, but by the end of 2024 there might already be Claude 3.5 Opus or a new Gemini.