The question is whether she’s talking parameter count, nominal training flops, or actual cost. In general, GPT generations so far have been roughly one order of magnitude apart in parameter count and training cost, and roughly two orders of magnitude in nominal training flops (parameter count x training tokens). Since she’s a CFO, and that was a financial discussion, I assume she natively thinks in terms of training cost, so the ‘correct’ answer to her is one order of magnitude not two, so my suspicion is that she’s actually talking in terms of parameter count. So I don’t think she’s warning us of anything, I think she’s just projecting a straight line on a logarithmic plot. I.e. business as usual at OpenAI.
Original GPT-4 is reportedly 2e25 FLOPs. A 100K H100s cluster trains a 2e26 BF16 FLOPs model (at 30% utilization) in 2.5 months. That only costs $600-900 million (at $3-5 per H100-hour), the reported $3 billion suggest more training time. If trained for 8 months at 40% utilization, we get 8e26 FLOPs, which cost at least $1.7 billion (at $3 per H100-hour). More recent GPT-4T or GPT-4o might already have about 1e26 FLOPs in them (20K H100s can get that in 5 months), so if these later GPT-4 variants are taken as baselines, 8e26 FLOPs could be said to be “about one order of magnitude bigger”.
nominal training flops (parameter count x training tokens)
Times 6, and it’s active parameter count, a MoE model can be much bigger without affecting the training FLOPs. So with original GPT-4 at maybe 270B active parameters, 1.8T total parameters, it’s the 270B that enters the training FLOPs estimate (from 2e25 FLOPs, we get a 12T tokens estimate for its training dataset size).
The question is whether she’s talking parameter count, nominal training flops, or actual cost. In general, GPT generations so far have been roughly one order of magnitude apart in parameter count and training cost, and roughly two orders of magnitude in nominal training flops (parameter count x training tokens). Since she’s a CFO, and that was a financial discussion, I assume she natively thinks in terms of training cost, so the ‘correct’ answer to her is one order of magnitude not two, so my suspicion is that she’s actually talking in terms of parameter count. So I don’t think she’s warning us of anything, I think she’s just projecting a straight line on a logarithmic plot. I.e. business as usual at OpenAI.
Original GPT-4 is reportedly 2e25 FLOPs. A 100K H100s cluster trains a 2e26 BF16 FLOPs model (at 30% utilization) in 2.5 months. That only costs $600-900 million (at $3-5 per H100-hour), the reported $3 billion suggest more training time. If trained for 8 months at 40% utilization, we get 8e26 FLOPs, which cost at least $1.7 billion (at $3 per H100-hour). More recent GPT-4T or GPT-4o might already have about 1e26 FLOPs in them (20K H100s can get that in 5 months), so if these later GPT-4 variants are taken as baselines, 8e26 FLOPs could be said to be “about one order of magnitude bigger”.
Times 6, and it’s active parameter count, a MoE model can be much bigger without affecting the training FLOPs. So with original GPT-4 at maybe 270B active parameters, 1.8T total parameters, it’s the 270B that enters the training FLOPs estimate (from 2e25 FLOPs, we get a 12T tokens estimate for its training dataset size).