But once you’ve done that, the training run itself is still, it seems, in the low nine figure range, for 3.8 x 10^25 FLOPS, less than the 10^26 threshold in the executive order or SB 1047
Even a bit less than that, the high eight figure range, under 100 million dollars.
It estimates that at full utilization, the 3.8x10^25 FLOPS can be extracted from H100 at approximately 10.6 million hours (which is slightly more than a quarter of the total 39.3 million hours for all three models).
The report puts utilization in the main pre-training phase of the 405B at 40% (they use 16384 GPUs). The main low hanging fruit is that they are using BF16 on H100s, so they only use 400 teraFLOP/s where they could get 800 with FP8. But given that their tokens per parameter estimation experiments were done with 4000x less compute than the 405B model itself (they get 40 tokens/parameter, similar to Imbue’s CARBS, and unlike Chinchilla’s 20), they seem to have been in a hurry, so using FP8 would’ve been too risky.
On Llama-3.1-405B pretraining costs:
Even a bit less than that, the high eight figure range, under 100 million dollars.
The model card https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md says that total for all 3 models, 8B, 70B, and 405B is 39.3 million GPU-hours, but for the 405B model alone it is 30.84 million GPU-hours for H100 GPU, that’s clearly less than 100 million dollars.
Interestingly, the GPU utilization seems low, slightly over 25%, if one believes this “on a napkin” computation I asked GPT-4o to assist me with: https://chatgpt.com/share/470b8e20-d99f-48b2-be5e-61d7524892df
It estimates that at full utilization, the 3.8x10^25 FLOPS can be extracted from H100 at approximately 10.6 million hours (which is slightly more than a quarter of the total 39.3 million hours for all three models).
The report puts utilization in the main pre-training phase of the 405B at 40% (they use 16384 GPUs). The main low hanging fruit is that they are using BF16 on H100s, so they only use 400 teraFLOP/s where they could get 800 with FP8. But given that their tokens per parameter estimation experiments were done with 4000x less compute than the 405B model itself (they get 40 tokens/parameter, similar to Imbue’s CARBS, and unlike Chinchilla’s 20), they seem to have been in a hurry, so using FP8 would’ve been too risky.