Claim 3: There may be fundamental “scaling laws” governing the amount of performance AI systems can achieve as a function of the data and computational resources.
I’m personally pretty sympathetic to the idea that there are indeed metrics through which model progress is continuous (both as a function of scale and over the course of training).
That being said: smooth performance along one metric doesn’t necessarily imply smooth downstream performance! (E.g. from your “SGD learns parity close to the computational limit” paper, even though there exist smooth progress measures on how small neural networks learn parity, this does not explain away the sharp increase in accuracy. See also the results from the modular addition task.)
In particular, it’s empirically true that smooth progress of log loss does not necessarily imply smooth progress on downstream performance. For example, in both the BIG-Bench and Wei et al’s “Emergent Abilities of Large Language Models” papers, we see that smooth performance on cross entropy loss or does not imply continuous smooth progress in terms of error rate. And though GPT-3 follows the same log loss scaling curve as GPT-2, I’m not sure anyone would have predicted the suite of new abilities that would arise alongside the decrease in log loss.
It is indeed the case that sometimes we see phase transitions / discontinuous improvements, and this is an area which I am very interested in. Note however that (while not in our paper) typically in graphs such as BIG-Bench, the X axis is something like log number of parameters. So it does seem you pay quite a price to achieve improvement.
The claim there is not so much about the shape of the laws but rather about potential (though as you say, not certain at all) limitations as to what improvements you can achieve through pure software alone, without investing more compute and/or data. Some other (very rough) calculations of costs are attempted in my previous blog post.
Yeah, I agree that a lot of the “phase transitions” look more discontinuous than they actually are due to the log on the x axis — the OG grokking paper definitely commits this sin, for example.
(I think there’s also another disagreement here about how close humans are to this natural limit.)
I’m personally pretty sympathetic to the idea that there are indeed metrics through which model progress is continuous (both as a function of scale and over the course of training).
That being said: smooth performance along one metric doesn’t necessarily imply smooth downstream performance! (E.g. from your “SGD learns parity close to the computational limit” paper, even though there exist smooth progress measures on how small neural networks learn parity, this does not explain away the sharp increase in accuracy. See also the results from the modular addition task.)
In particular, it’s empirically true that smooth progress of log loss does not necessarily imply smooth progress on downstream performance. For example, in both the BIG-Bench and Wei et al’s “Emergent Abilities of Large Language Models” papers, we see that smooth performance on cross entropy loss or does not imply continuous smooth progress in terms of error rate. And though GPT-3 follows the same log loss scaling curve as GPT-2, I’m not sure anyone would have predicted the suite of new abilities that would arise alongside the decrease in log loss.
(It also doesn’t rule out the existence of better scaffolding or prompting techniques like Chain-of-Thought, which can both significantly improve downstream performance and even change the shape of scaling curves, without additional training).
It is indeed the case that sometimes we see phase transitions / discontinuous improvements, and this is an area which I am very interested in. Note however that (while not in our paper) typically in graphs such as BIG-Bench, the X axis is something like log number of parameters. So it does seem you pay quite a price to achieve improvement.
The claim there is not so much about the shape of the laws but rather about potential (though as you say, not certain at all) limitations as to what improvements you can achieve through pure software alone, without investing more compute and/or data. Some other (very rough) calculations of costs are attempted in my previous blog post.
Yeah, I agree that a lot of the “phase transitions” look more discontinuous than they actually are due to the log on the x axis — the OG grokking paper definitely commits this sin, for example.
(I think there’s also another disagreement here about how close humans are to this natural limit.)