Meta and Microsoft ordered 150K GPUs each, big H100 backlog. According to Lennart’s BOTECs, 50,000 H100s would train a model the size of Gemini in around a month (assuming 50% utilization)
Just to check my understanding, here’s my BOTEC of the number of FLOPs for 50k H100s during a month: 5e4 H100s * 1e15 bf16 FLOPs/second * 0.5 utilization * (3600 * 24 * 30) seconds/month = 6.48e25 FLOPs.
This is indeed close enough to Epoch’s median estimate of 7.7e25 FLOPs for Gemini Ultra 1.0 (this doc cites an Epoch estimate of around 9e25 FLOPs). ETA: see clarification in Eli’s reply.
I’m curious if we have info about the floating point format used for these training runs: how confident are we that labs are using bf16 rather than fp8?
From the Rough Notes section of Ajeya’s shared scenario:
Just to check my understanding, here’s my BOTEC of the number of FLOPs for 50k H100s during a month: 5e4 H100s * 1e15 bf16 FLOPs/second * 0.5 utilization * (3600 * 24 * 30) seconds/month = 6.48e25 FLOPs.
This is indeed close enough to Epoch’s median estimate of 7.7e25 FLOPs for Gemini Ultra 1.0 (this doc cites an Epoch estimate of around 9e25 FLOPs). ETA: see clarification in Eli’s reply.
I’m curious if we have info about the floating point format used for these training runs: how confident are we that labs are using bf16 rather than fp8?
FYI at the time that doc was created, Epoch had 9e25. Now the notebook says 7.7e25 but their webpage says 5e25. Will ask them about it.