If you want to e.g. predict text on the Internet, you can do a better job of it if you can solve small data problems than if you can’t.
For example, in the following text (which I copied from here):
“Look carefully for the pattern, and then choose which pair of numbers comes next.
42 40 38 35 33 31 28
A. 25 22 B. 26 23 C. 26 24 D. 25 23 E. 26 22
Answer & Explanation: Answer: Option”
You will do a better job at predicting the next token if you can learn the pattern from the given sequence of 7 numbers.
This is a very very small benefit in absolute terms, but once you get to very very large models that is the sort of thing you learn.
I expect a similar thing will be true for whichever small-data problems you have in mind (though they may require models that can have more context than GPT-3 can have).
What you mean by “solving the small data problem is useful for loss”?
If you want to e.g. predict text on the Internet, you can do a better job of it if you can solve small data problems than if you can’t.
For example, in the following text (which I copied from here):
“Look carefully for the pattern, and then choose which pair of numbers comes next.
42 40 38 35 33 31 28
A. 25 22
B. 26 23
C. 26 24
D. 25 23
E. 26 22
Answer & Explanation:
Answer: Option”
You will do a better job at predicting the next token if you can learn the pattern from the given sequence of 7 numbers.
This is a very very small benefit in absolute terms, but once you get to very very large models that is the sort of thing you learn.
I expect a similar thing will be true for whichever small-data problems you have in mind (though they may require models that can have more context than GPT-3 can have).