Llama-3-405B is an important anchor for compute of other models. With 4e25 FLOPs and conservative training techniques it’s about as capable, so the other models probably don’t use much more. If they have better techniques, they need less compute to get similar performance, not more. And they probably didn’t train for more than 6 months. At $2 per H100-hour[1], $3 billion buys 6 months of time on 300K H100s. There are no publicly known training systems this large, the first 100K H100s systems started appearing in the later part of this year. Thus the training cost figures must include smaller experiments that in aggregate eat more compute than the largest training runs, through the now-ubiquitous smaller clusters also used for inference.
So anchoring to total number of GPUs is misleading about frontier model training because most GPUs are used for inference and smaller experiments, and the above estimate shows that figures like $3 billion for training are also poor anchors. If instead we look at 20K H100s as the typical scale of largest clusters in mid 2023 to early 2024, and 4 months as a typical duration of frontier model training, we get $120 million at $2 per H100-hour or 8e25 dense BF16 FLOPs at 40% compute utilization, only about 2x Llama-3-405B compute. This agrees with how Dario Amodei claimed that in Jun 2024 the scale of deployed models is about $100 million.
For what it’s worth, since training the largest models requires building the training system yourself, which makes the market price of renting fewer GPUs from much smaller clusters not that relevant.
Llama-3-405B is an important anchor for compute of other models. With 4e25 FLOPs and conservative training techniques it’s about as capable, so the other models probably don’t use much more. If they have better techniques, they need less compute to get similar performance, not more. And they probably didn’t train for more than 6 months. At $2 per H100-hour[1], $3 billion buys 6 months of time on 300K H100s. There are no publicly known training systems this large, the first 100K H100s systems started appearing in the later part of this year. Thus the training cost figures must include smaller experiments that in aggregate eat more compute than the largest training runs, through the now-ubiquitous smaller clusters also used for inference.
So anchoring to total number of GPUs is misleading about frontier model training because most GPUs are used for inference and smaller experiments, and the above estimate shows that figures like $3 billion for training are also poor anchors. If instead we look at 20K H100s as the typical scale of largest clusters in mid 2023 to early 2024, and 4 months as a typical duration of frontier model training, we get $120 million at $2 per H100-hour or 8e25 dense BF16 FLOPs at 40% compute utilization, only about 2x Llama-3-405B compute. This agrees with how Dario Amodei claimed that in Jun 2024 the scale of deployed models is about $100 million.
For what it’s worth, since training the largest models requires building the training system yourself, which makes the market price of renting fewer GPUs from much smaller clusters not that relevant.