But scaling today’s ML systems cannot solve the small data problem because historical validation is an invalid measure of performance when long-tailed outcomes are involved. Big data approaches implicitly rely on historical validation.
The scaling hypothesis says “whatever algorithm you think can solve the small data problem, something analogous will eventually be learned by a large enough neural net with enough data + compute, because solving the small data problem is useful for loss”.
Importantly, you don’t solve small data problems by running gradient descent on them. You solve them by taking you big pretrained neural network, providing the small data problem as an input to that neural network, and let the forward passes of the neural network solve the problem, which works because those forward passes are executing similar algorithms to <the ones which actually work>.
If you want to e.g. predict text on the Internet, you can do a better job of it if you can solve small data problems than if you can’t.
For example, in the following text (which I copied from here):
“Look carefully for the pattern, and then choose which pair of numbers comes next.
42 40 38 35 33 31 28
A. 25 22 B. 26 23 C. 26 24 D. 25 23 E. 26 22
Answer & Explanation: Answer: Option”
You will do a better job at predicting the next token if you can learn the pattern from the given sequence of 7 numbers.
This is a very very small benefit in absolute terms, but once you get to very very large models that is the sort of thing you learn.
I expect a similar thing will be true for whichever small-data problems you have in mind (though they may require models that can have more context than GPT-3 can have).
The scaling hypothesis says “whatever algorithm you think can solve the small data problem, something analogous will eventually be learned by a large enough neural net with enough data + compute, because solving the small data problem is useful for loss”.
Importantly, you don’t solve small data problems by running gradient descent on them. You solve them by taking you big pretrained neural network, providing the small data problem as an input to that neural network, and let the forward passes of the neural network solve the problem, which works because those forward passes are executing similar algorithms to <the ones which actually work>.
What you mean by “solving the small data problem is useful for loss”?
If you want to e.g. predict text on the Internet, you can do a better job of it if you can solve small data problems than if you can’t.
For example, in the following text (which I copied from here):
“Look carefully for the pattern, and then choose which pair of numbers comes next.
42 40 38 35 33 31 28
A. 25 22
B. 26 23
C. 26 24
D. 25 23
E. 26 22
Answer & Explanation:
Answer: Option”
You will do a better job at predicting the next token if you can learn the pattern from the given sequence of 7 numbers.
This is a very very small benefit in absolute terms, but once you get to very very large models that is the sort of thing you learn.
I expect a similar thing will be true for whichever small-data problems you have in mind (though they may require models that can have more context than GPT-3 can have).