This point is semi-correct now, but mostly incorrect for future systems.
A larger model learns faster per data point which is increasingly important as we move towards AGI. If you want a system which has mostly memorized the internet then sure—overtraining a small model now makes sense. If you want a system that can rapidly continuously transfer learn from minimal amounts of new data to compete with smart humans, then you probably want something far larger than even the naive[1] chinchilla optimum.
This point is semi-correct now, but mostly incorrect for future systems. A larger model learns faster per data point which is increasingly important as we move towards AGI. If you want a system which has mostly memorized the internet then sure—overtraining a small model now makes sense. If you want a system that can rapidly continuously transfer learn from minimal amounts of new data to compete with smart humans, then you probably want something far larger than even the naive[1] chinchilla optimum.
Naive in the sense that it only considers total compute cost of training, without considering future downstream data efficiency.