Interesting. Based on skimming the paper, my impression is that, to a first approximation, this would look like:
Instead of having linear performance on the y-axis, switch to something like log(max_performance—actual_performance). (So that we get a log-log plot.)
Then for each series of data points, look for the largest n such that the last n data points are roughly on a line. (I.e. identify the last power law segment.)
Then to extrapolate into the future, project that line forward. (I.e. fit a power law to the last power law segment and project it forward.)
That description misses out on effects where BNSL-fitting would predict that there’s a slow, smooth shift from one power-law to another, and that this gradual shift will continue into the future. I don’t know how important that is. Curious for your intuition about whether or not that’s important, and/or other reasons for why my above description is or isn’t reasonable.
When I think about applying that algorithm to the above plots, I worry that the data points are much too noisy to just extrapolate a line from the last few data points. Maybe the practical thing to do would be to assume that the 2nd half of the “sigmoid” forms a distinct power law segment, and fit a power law to the points with >~50% performance (or less than that if there are too few points with >50% performance). Which maybe suggests that the claim “BNSL does better” corresponds to a claim that the speed at which the language models improve on ~random performance (bottom part of the “sigmoid”) isn’t informative for how fast they converge to ~maximum performance (top part of the “sigmoid”)? That seems plausible.
We describe how to go about fitting a BNSL to yield best extrapolation in the last paragraph of Appendix Section A.6 “Experimental details of fitting BNSL and determining the number of breaks” of the paper: https://arxiv.org/pdf/2210.14891.pdf#page=13
Sigmoids don’t accurately extrapolate the scaling behavior(s) of the performance of artificial neural networks.
Use a Broken Neural Scaling Law (BNSL) in order to obtain accurate extrapolations:
https://arxiv.org/abs/2210.14891
https://arxiv.org/pdf/2210.14891.pdf
Interesting. Based on skimming the paper, my impression is that, to a first approximation, this would look like:
Instead of having linear performance on the y-axis, switch to something like log(max_performance—actual_performance). (So that we get a log-log plot.)
Then for each series of data points, look for the largest n such that the last n data points are roughly on a line. (I.e. identify the last power law segment.)
Then to extrapolate into the future, project that line forward. (I.e. fit a power law to the last power law segment and project it forward.)
That description misses out on effects where BNSL-fitting would predict that there’s a slow, smooth shift from one power-law to another, and that this gradual shift will continue into the future. I don’t know how important that is. Curious for your intuition about whether or not that’s important, and/or other reasons for why my above description is or isn’t reasonable.
When I think about applying that algorithm to the above plots, I worry that the data points are much too noisy to just extrapolate a line from the last few data points. Maybe the practical thing to do would be to assume that the 2nd half of the “sigmoid” forms a distinct power law segment, and fit a power law to the points with >~50% performance (or less than that if there are too few points with >50% performance). Which maybe suggests that the claim “BNSL does better” corresponds to a claim that the speed at which the language models improve on ~random performance (bottom part of the “sigmoid”) isn’t informative for how fast they converge to ~maximum performance (top part of the “sigmoid”)? That seems plausible.
We describe how to go about fitting a BNSL to yield best extrapolation in the last paragraph of Appendix Section A.6 “Experimental details of fitting BNSL and determining the number of breaks” of the paper:
https://arxiv.org/pdf/2210.14891.pdf#page=13