Google has very large datacenters, if measured in megawatts, but they are filled with older TPUs. Maybe they are fine compared to H100s on FLOP/joule basis though? In BF16, A100 (0.3e15 FLOP/s, 400W) to H100 (1e15 FLOP/s, 700W) to B100 (1.8e15 FLOP/s, 700W) notably improve FLOP/joule, but for recent TPUs TDP is not disclosed (and the corresponding fraction of the rest of the datacenter needs to be taken into account, for example it turns 700W of an H100 into about 1500W). In terms of FLOPS/GPU, only the latest generation announced in May 2024 matches H100s, it might take time to install enough of them.
They seem to have big plans for next year, but possibly they are not yet quite ready to be significantly ahead of 100K H100s clusters.
On priors I think that Google Deepmind is currently running the biggest training run.
Google has very large datacenters, if measured in megawatts, but they are filled with older TPUs. Maybe they are fine compared to H100s on FLOP/joule basis though? In BF16, A100 (0.3e15 FLOP/s, 400W) to H100 (1e15 FLOP/s, 700W) to B100 (1.8e15 FLOP/s, 700W) notably improve FLOP/joule, but for recent TPUs TDP is not disclosed (and the corresponding fraction of the rest of the datacenter needs to be taken into account, for example it turns 700W of an H100 into about 1500W). In terms of FLOPS/GPU, only the latest generation announced in May 2024 matches H100s, it might take time to install enough of them.
They seem to have big plans for next year, but possibly they are not yet quite ready to be significantly ahead of 100K H100s clusters.
Thanks, that updates me. I’ve been enjoying your well-informed comments on big training runs, thank you!