It is indeed the case that sometimes we see phase transitions / discontinuous improvements, and this is an area which I am very interested in. Note however that (while not in our paper) typically in graphs such as BIG-Bench, the X axis is something like log number of parameters. So it does seem you pay quite a price to achieve improvement.
The claim there is not so much about the shape of the laws but rather about potential (though as you say, not certain at all) limitations as to what improvements you can achieve through pure software alone, without investing more compute and/or data. Some other (very rough) calculations of costs are attempted in my previous blog post.
Yeah, I agree that a lot of the “phase transitions” look more discontinuous than they actually are due to the log on the x axis — the OG grokking paper definitely commits this sin, for example.
(I think there’s also another disagreement here about how close humans are to this natural limit.)
It is indeed the case that sometimes we see phase transitions / discontinuous improvements, and this is an area which I am very interested in. Note however that (while not in our paper) typically in graphs such as BIG-Bench, the X axis is something like log number of parameters. So it does seem you pay quite a price to achieve improvement.
The claim there is not so much about the shape of the laws but rather about potential (though as you say, not certain at all) limitations as to what improvements you can achieve through pure software alone, without investing more compute and/or data. Some other (very rough) calculations of costs are attempted in my previous blog post.
Yeah, I agree that a lot of the “phase transitions” look more discontinuous than they actually are due to the log on the x axis — the OG grokking paper definitely commits this sin, for example.
(I think there’s also another disagreement here about how close humans are to this natural limit.)