it’s widely believed that OpenAI trained GPT4 on about 10,000 A100s
What I can find is 20,000 A100s. With 10K A100s, which are 300e12 FLOP/s in BF16, you’d need 6 months (so this is still plausible) at 40% utilization to get the rumored 2e25 FLOPs. We know Llama-3-405B is 4e25 FLOPs and approximately as smart, and it’s dense, so you can get away with fewer FLOPs in a MoE model to get similar capabilities, which supports the 2e25 FLOPs figure from the premise that original GPT-4 is MoE.
The average H100 has 80 GB of VRAM
H200s are 140 GB, and there are now MI300Xs with 192 GB. B200s will also have 192 GB.
assuming that each parameter is 32 bits
Training is typically in BF16, though you need enough space for gradients in addition to parameters (and with ZeRO, optimizer states). On the other hand, inference in 8 bit quantization is essentially indistinguishable from full precision.
Recently though, Microsoft and Meta have both moved to acquire more GPUs that put them in the 100,000 range
He says 500K GB200s, but also that it’s 1 gigawatt all told, and that they are 2-3x faster than H100s, so I believe he means 500K B200s. In various places, “GB200” seems to ambiguously refer either to a 2-GPU board with a Grace CPU, or to one of the B200s on such a board.
Thanks for the clarifications. My naive estimate is obviously just a simplistic ballpark figure using some rough approximations, so I appreciate adding some precision.
What I can find is 20,000 A100s. With 10K A100s, which are 300e12 FLOP/s in BF16, you’d need 6 months (so this is still plausible) at 40% utilization to get the rumored 2e25 FLOPs. We know Llama-3-405B is 4e25 FLOPs and approximately as smart, and it’s dense, so you can get away with fewer FLOPs in a MoE model to get similar capabilities, which supports the 2e25 FLOPs figure from the premise that original GPT-4 is MoE.
H200s are 140 GB, and there are now MI300Xs with 192 GB. B200s will also have 192 GB.
Training is typically in BF16, though you need enough space for gradients in addition to parameters (and with ZeRO, optimizer states). On the other hand, inference in 8 bit quantization is essentially indistinguishable from full precision.
The word is, next year it’s 500K B200s[1] for Microsoft. And something in the gigawatt range from Google as well.
He says 500K GB200s, but also that it’s 1 gigawatt all told, and that they are 2-3x faster than H100s, so I believe he means 500K B200s. In various places, “GB200” seems to ambiguously refer either to a 2-GPU board with a Grace CPU, or to one of the B200s on such a board.
Thanks for the clarifications. My naive estimate is obviously just a simplistic ballpark figure using some rough approximations, so I appreciate adding some precision.