Thinking back to the “inconsistency” from the Kaplan et al papers...
In Appendix E of the new paper, we see the loss-vs-compute frontier start to “bend” from a straight line on a log-log plot, with returns to additional compute getting smaller at large scales.
I suspect this bending is the transition from the faster “L(C) law” to the slower “L(D) law.”
A brief recap of that below:
Adding more params can help in two ways: it makes your model’s loss decline toward its asymptotic minimum faster, and it can lower that minimum itself.
As models get bigger, the first effect dies off—the loss curves converge to a fixed shape, rather than getter ever steeper. The second effect keeps going, but with it alone, the overall rate of return is lower.
Presumably, the learning rate issue in Kaplan et. al. also affected their estimated L(D) law.
The issue made Kaplan et al underestimate optimal model performance. The underestimate was worst when considering models for which the optimal number of training steps was small.
The L(D) law came from early stopping experiments. The early stopping step is lower for smaller data sizes.
So the L(D) experiments with smaller D values look artificially bad, relative to the ones with large D values. Thus the estimated L(D) curve declines faster than the true L(D) curve.
If this is correct, then L(D) improves more slowly with data than we had believed.
Note that this does contradict the “use more data!” result from the paper—that is about the relative rate at which N and D affect L(N, D).
Thinking back to the “inconsistency” from the Kaplan et al papers...
In Appendix E of the new paper, we see the loss-vs-compute frontier start to “bend” from a straight line on a log-log plot, with returns to additional compute getting smaller at large scales.
I suspect this bending is the transition from the faster “L(C) law” to the slower “L(D) law.”
A brief recap of that below:
Adding more params can help in two ways: it makes your model’s loss decline toward its asymptotic minimum faster, and it can lower that minimum itself.
As models get bigger, the first effect dies off—the loss curves converge to a fixed shape, rather than getter ever steeper. The second effect keeps going, but with it alone, the overall rate of return is lower.
Presumably, the learning rate issue in Kaplan et. al. also affected their estimated L(D) law.
The issue made Kaplan et al underestimate optimal model performance. The underestimate was worst when considering models for which the optimal number of training steps was small.
The L(D) law came from early stopping experiments. The early stopping step is lower for smaller data sizes.
So the L(D) experiments with smaller D values look artificially bad, relative to the ones with large D values. Thus the estimated L(D) curve declines faster than the true L(D) curve.
If this is correct, then L(D) improves more slowly with data than we had believed.
Note that this does contradict the “use more data!” result from the paper—that is about the relative rate at which N and D affect L(N, D).