Ethan Caballero comments on PaLM-2 & GPT-4 in “Extrapolating GPT-N performance”

Ethan Caballero 30 May 2023 23:12 UTC
LW: 4 AF: -2
−4
AF
Sigmoids don’t accurately extrapolate the scaling behavior(s) of the performance of artificial neural networks.

Use a Broken Neural Scaling Law (BNSL) in order to obtain accurate extrapolations:
https://arxiv.org/abs/2210.14891
https://arxiv.org/pdf/2210.14891.pdf
- Lukas Finnveden 31 May 2023 19:33 UTC
  LW: 4 AF: 2
  0
  AF Parent
  Interesting. Based on skimming the paper, my impression is that, to a first approximation, this would look like:
  - Instead of having linear performance on the y-axis, switch to something like log(max_performance—actual_performance). (So that we get a log-log plot.)
  - Then for each series of data points, look for the largest n such that the last n data points are roughly on a line. (I.e. identify the last power law segment.)
  - Then to extrapolate into the future, project that line forward. (I.e. fit a power law to the last power law segment and project it forward.)
  That description misses out on effects where BNSL-fitting would predict that there’s a slow, smooth shift from one power-law to another, and that this gradual shift will continue into the future. I don’t know how important that is. Curious for your intuition about whether or not that’s important, and/or other reasons for why my above description is or isn’t reasonable.
  When I think about applying that algorithm to the above plots, I worry that the data points are much too noisy to just extrapolate a line from the last few data points. Maybe the practical thing to do would be to assume that the 2nd half of the “sigmoid” forms a distinct power law segment, and fit a power law to the points with >~50% performance (or less than that if there are too few points with >50% performance). Which maybe suggests that the claim “BNSL does better” corresponds to a claim that the speed at which the language models improve on ~random performance (bottom part of the “sigmoid”) isn’t informative for how fast they converge to ~maximum performance (top part of the “sigmoid”)? That seems plausible.
  - Ethan Caballero 5 Jun 2023 5:58 UTC
    LW: 1 AF: 1
    0
    AF Parent
    We describe how to go about fitting a BNSL to yield best extrapolation in the last paragraph of Appendix Section A.6 “Experimental details of fitting BNSL and determining the number of breaks” of the paper:
    https://arxiv.org/pdf/2210.14891.pdf#page=13