nostalgebraist comments on [Link] Training Compute-Optimal Large Language Models

nostalgebraist 2 Apr 2022 4:27 UTC
LW: 15 AF: 3
AF
Thinking back to the “inconsistency” from the Kaplan et al papers...
- In Appendix E of the new paper, we see the loss-vs-compute frontier start to “bend” from a straight line on a log-log plot, with returns to additional compute getting smaller at large scales.
- I suspect this bending is the transition from the faster “L(C) law” to the slower “L(D) law.”
  - A brief recap of that below:
    Adding more params can help in two ways: it makes your model’s loss decline toward its asymptotic minimum faster, and it can lower that minimum itself.
    As models get bigger, the first effect dies off—the loss curves converge to a fixed shape, rather than getter ever steeper. The second effect keeps going, but with it alone, the overall rate of return is lower.
- Presumably, the learning rate issue in Kaplan et. al. also affected their estimated L(D) law.
  - The issue made Kaplan et al underestimate optimal model performance. The underestimate was worst when considering models for which the optimal number of training steps was small.
  - The L(D) law came from early stopping experiments. The early stopping step is lower for smaller data sizes.
  - So the L(D) experiments with smaller D values look artificially bad, relative to the ones with large D values. Thus the estimated L(D) curve declines faster than the true L(D) curve.
  - If this is correct, then L(D) improves more slowly with data than we had believed.
  - Note that this does contradict the “use more data!” result from the paper—that is about the relative rate at which N and D affect L(N, D).