Base model scale has only increased maybe 3-5x in the last 2 years, from 2e25 FLOPs (original GPT-4) up to maybe 1e26 FLOPs[1]. So I think to a significant extent the experiment of further scaling hasn’t been run, and the 100K H100s clusters that have just started training new models in the last few months promise another 3-5x increase in scale, to 2e26-6e26 FLOPs.
possibly have already plateaued a year or so ago
Right, the metrics don’t quite capture how smart a model is, and the models haven’t been getting much smarter for a while now. But it might be simply because they weren’t scaled much further (compared to original GPT-4) in all this time. We’ll see in the next few months as the labs deploy the models trained on 100K H100s (and whatever systems Google has).
This is 3 months on 30K H100s, $140 million at $2 per H100-hour, which is plausible, but not rumored about specific models. Llama-3-405B is 4e25 FLOPs, but not MoE. Could well be that 6e25 FLOPs is the most anyone trained for with models deployed so far.
Base model scale has only increased maybe 3-5x in the last 2 years, from 2e25 FLOPs (original GPT-4) up to maybe 1e26 FLOPs[1]. So I think to a significant extent the experiment of further scaling hasn’t been run, and the 100K H100s clusters that have just started training new models in the last few months promise another 3-5x increase in scale, to 2e26-6e26 FLOPs.
Right, the metrics don’t quite capture how smart a model is, and the models haven’t been getting much smarter for a while now. But it might be simply because they weren’t scaled much further (compared to original GPT-4) in all this time. We’ll see in the next few months as the labs deploy the models trained on 100K H100s (and whatever systems Google has).
This is 3 months on 30K H100s, $140 million at $2 per H100-hour, which is plausible, but not rumored about specific models. Llama-3-405B is 4e25 FLOPs, but not MoE. Could well be that 6e25 FLOPs is the most anyone trained for with models deployed so far.