The post argues that there is a latency limit at 2e31 FLOP, and I’ve found it useful to put this scale into perspective.
Current public models such as Llama 3 405B are estimated to be trained with ~4e25 flops , so such a model would require 500,000 x more compute. Since Llama 3 405B was trained with 16,000 H-100 GPUs, the model would require 8 billion H-100 GPU equivalents, at a cost of $320 trillion with H-100 pricing (or ~$100 trillion if we use B-200s). Perhaps future hardware would reduce these costs by an order of magnitude, but this is cancelled out by another factor; the 2e31 limit assumes a training time of only 3 months. If we were to build such a system over several years and had the patience to wait an additional 3 years for the training run to complete, this pushes the latency limit out by another order of magnitude. So at the point where we are bound by the latency limit, we are either investing a significant percentage of world GDP into the project, or we have already reached ASI at a smaller scale of compute and are using it to dramatically reduce compute costs for successor models.
Of course none of this analysis applies to the earlier data limit of 2e28 flop, which I think is more relevant and interesting.
“A recent paper which was published only a few days before the publication of our own work, Zhang et al. (2024), finds a scaling of B = 17.75 D^0.47 (in units of tokens). If we rigorously take this more aggressive scaling into account in our model, the fall in utilization is pushed out by two orders of magnitude; starting around 3e30 instead of 2e28. Of course, even more aggressive scaling might be possible with methods that Zhang et al. (2024) do not explore, such as using alternative optimizers.”
I haven’t looked carefully at Zhang et al., but assuming their analysis is correct and the data wall is at 3e30 FLOP, it’s plausible that we hit resource constraints ($10-100 trillion training runs, 2-20 TW power required) before we hit the data movement limit.
Speculatively, this might also differentially incentivize (research on generalized) inference scaling, with various potential strategic implications, including for AI safety (current inference scaling methods tend to be tied to CoT and the like, which are quite transparent) and for regulatory frameworks/proliferation of dangerous capabilities.
current inference scaling methods tend to be tied to CoT and the like, which are quite transparent
Aschenbrenner in Situational Awareness predicts illegible chains of thought are going to prevail because they are more efficient. I know of one developer claiming to do this (https://platonicresearch.com/) but I guess there must be many.
‘Data movement bottlenecks limit LLM scaling beyond 2e28 FLOP, with a “latency wall” at 2e31 FLOP. We may hit these in ~3 years. Aggressive batch size scaling could potentially overcome these limits.’ https://epochai.org/blog/data-movement-bottlenecks-scaling-past-1e28-flop
The post argues that there is a latency limit at 2e31 FLOP, and I’ve found it useful to put this scale into perspective.
Current public models such as Llama 3 405B are estimated to be trained with ~4e25 flops , so such a model would require 500,000 x more compute. Since Llama 3 405B was trained with 16,000 H-100 GPUs, the model would require 8 billion H-100 GPU equivalents, at a cost of $320 trillion with H-100 pricing (or ~$100 trillion if we use B-200s). Perhaps future hardware would reduce these costs by an order of magnitude, but this is cancelled out by another factor; the 2e31 limit assumes a training time of only 3 months. If we were to build such a system over several years and had the patience to wait an additional 3 years for the training run to complete, this pushes the latency limit out by another order of magnitude. So at the point where we are bound by the latency limit, we are either investing a significant percentage of world GDP into the project, or we have already reached ASI at a smaller scale of compute and are using it to dramatically reduce compute costs for successor models.
Of course none of this analysis applies to the earlier data limit of 2e28 flop, which I think is more relevant and interesting.
An important caveat to the data movement limit:
“A recent paper which was published only a few days before the publication of our own work, Zhang et al. (2024), finds a scaling of B = 17.75 D^0.47 (in units of tokens). If we rigorously take this more aggressive scaling into account in our model, the fall in utilization is pushed out by two orders of magnitude; starting around 3e30 instead of 2e28. Of course, even more aggressive scaling might be possible with methods that Zhang et al. (2024) do not explore, such as using alternative optimizers.”
I haven’t looked carefully at Zhang et al., but assuming their analysis is correct and the data wall is at 3e30 FLOP, it’s plausible that we hit resource constraints ($10-100 trillion training runs, 2-20 TW power required) before we hit the data movement limit.
Speculatively, this might also differentially incentivize (research on generalized) inference scaling, with various potential strategic implications, including for AI safety (current inference scaling methods tend to be tied to CoT and the like, which are quite transparent) and for regulatory frameworks/proliferation of dangerous capabilities.
Aschenbrenner in Situational Awareness predicts illegible chains of thought are going to prevail because they are more efficient. I know of one developer claiming to do this (https://platonicresearch.com/) but I guess there must be many.
under the assumptions here (including Chinchilla scaling laws), depth wouldn’t increase by more than about 3x before the utilization rate starts dropping (because depth would increase with exponent about 1⁄6 of the total increase in FLOP); which seems like great news for the legibility of CoT outputs and similar and vs. opaque reasoning in models: https://lesswrong.com/posts/HmQGHGCnvmpCNDBjc/current-ais-provide-nearly-no-data-relevant-to-agi-alignment#mcA57W6YK6a2TGaE2
Interesting paper, though the estimates here don’t seem to account for Epoch’s correction to the chinchilla scaling laws: https://epochai.org/blog/chinchilla-scaling-a-replication-attempt
This would imply that the data movement bottleneck is a bit further out.