Original GPT-4 is reportedly 2e25 FLOPs. A 100K H100s cluster trains a 2e26 BF16 FLOPs model (at 30% utilization) in 2.5 months. That only costs $600-900 million (at $3-5 per H100-hour), the reported $3 billion suggest more training time. If trained for 8 months at 40% utilization, we get 8e26 FLOPs, which cost at least $1.7 billion (at $3 per H100-hour). More recent GPT-4T or GPT-4o might already have about 1e26 FLOPs in them (20K H100s can get that in 5 months), so if these later GPT-4 variants are taken as baselines, 8e26 FLOPs could be said to be “about one order of magnitude bigger”.
nominal training flops (parameter count x training tokens)
Times 6, and it’s active parameter count, a MoE model can be much bigger without affecting the training FLOPs. So with original GPT-4 at maybe 270B active parameters, 1.8T total parameters, it’s the 270B that enters the training FLOPs estimate (from 2e25 FLOPs, we get a 12T tokens estimate for its training dataset size).
Original GPT-4 is reportedly 2e25 FLOPs. A 100K H100s cluster trains a 2e26 BF16 FLOPs model (at 30% utilization) in 2.5 months. That only costs $600-900 million (at $3-5 per H100-hour), the reported $3 billion suggest more training time. If trained for 8 months at 40% utilization, we get 8e26 FLOPs, which cost at least $1.7 billion (at $3 per H100-hour). More recent GPT-4T or GPT-4o might already have about 1e26 FLOPs in them (20K H100s can get that in 5 months), so if these later GPT-4 variants are taken as baselines, 8e26 FLOPs could be said to be “about one order of magnitude bigger”.
Times 6, and it’s active parameter count, a MoE model can be much bigger without affecting the training FLOPs. So with original GPT-4 at maybe 270B active parameters, 1.8T total parameters, it’s the 270B that enters the training FLOPs estimate (from 2e25 FLOPs, we get a 12T tokens estimate for its training dataset size).