The report puts utilization in the main pre-training phase of the 405B at 40% (they use 16384 GPUs). The main low hanging fruit is that they are using BF16 on H100s, so they only use 400 teraFLOP/s where they could get 800 with FP8. But given that their tokens per parameter estimation experiments were done with 4000x less compute than the 405B model itself (they get 40 tokens/parameter, similar to Imbue’s CARBS, and unlike Chinchilla’s 20), they seem to have been in a hurry, so using FP8 would’ve been too risky.
The report puts utilization in the main pre-training phase of the 405B at 40% (they use 16384 GPUs). The main low hanging fruit is that they are using BF16 on H100s, so they only use 400 teraFLOP/s where they could get 800 with FP8. But given that their tokens per parameter estimation experiments were done with 4000x less compute than the 405B model itself (they get 40 tokens/parameter, similar to Imbue’s CARBS, and unlike Chinchilla’s 20), they seem to have been in a hurry, so using FP8 would’ve been too risky.