This is not what I took from those papers. The scaling laws paper has a figure showing that if you hold data fixed and increase model size, performance improves, whereas “you need to show more data points to the larger models” would predict that performance would degrade, because if the model gets larger then it’s needs aren’t being met.
Rather, what’s going on is that at the optimal compute allocation larger models get shown more data points. The way to maximize performance with a given increase in compute is to allocate a bit more than half of the increased compute to increased model size, and the remainder to increased data.
That said, figure 4 still overestimates the gains we should expect from increased compute, I think. But for a different reason: The small models in that figure were given “too much data,” (they were all given 300B tokens IIRC) and thus represent inefficient uses of compute—the same amount of compute would have led to more performance if they had increased model size a bit and decreased data. So the “true slope” of the line—the slope the line would have if compute had been used optimally, which is what we want to extrapolate—would be slightly smaller.
This is not what I took from those papers. The scaling laws paper has a figure showing that if you hold data fixed and increase model size, performance improves, whereas “you need to show more data points to the larger models” would predict that performance would degrade, because if the model gets larger then it’s needs aren’t being met.
Rather, what’s going on is that at the optimal compute allocation larger models get shown more data points. The way to maximize performance with a given increase in compute is to allocate a bit more than half of the increased compute to increased model size, and the remainder to increased data.
That said, figure 4 still overestimates the gains we should expect from increased compute, I think. But for a different reason: The small models in that figure were given “too much data,” (they were all given 300B tokens IIRC) and thus represent inefficient uses of compute—the same amount of compute would have led to more performance if they had increased model size a bit and decreased data. So the “true slope” of the line—the slope the line would have if compute had been used optimally, which is what we want to extrapolate—would be slightly smaller.
Thank you both for correcting me, I have removed that section from the post.