“A recent paper which was published only a few days before the publication of our own work, Zhang et al. (2024), finds a scaling of B = 17.75 D^0.47 (in units of tokens). If we rigorously take this more aggressive scaling into account in our model, the fall in utilization is pushed out by two orders of magnitude; starting around 3e30 instead of 2e28. Of course, even more aggressive scaling might be possible with methods that Zhang et al. (2024) do not explore, such as using alternative optimizers.”
I haven’t looked carefully at Zhang et al., but assuming their analysis is correct and the data wall is at 3e30 FLOP, it’s plausible that we hit resource constraints ($10-100 trillion training runs, 2-20 TW power required) before we hit the data movement limit.
An important caveat to the data movement limit:
“A recent paper which was published only a few days before the publication of our own work, Zhang et al. (2024), finds a scaling of B = 17.75 D^0.47 (in units of tokens). If we rigorously take this more aggressive scaling into account in our model, the fall in utilization is pushed out by two orders of magnitude; starting around 3e30 instead of 2e28. Of course, even more aggressive scaling might be possible with methods that Zhang et al. (2024) do not explore, such as using alternative optimizers.”
I haven’t looked carefully at Zhang et al., but assuming their analysis is correct and the data wall is at 3e30 FLOP, it’s plausible that we hit resource constraints ($10-100 trillion training runs, 2-20 TW power required) before we hit the data movement limit.