Method: The OpenAI team ran a suite of 63 different benchmarks (including sub-types), each for zero/one/few shot. In each scenario, there are 8 model sizes. I looked at how results scale with model size. With only 8 measurements, there is a large associated uncertainty for predictions. Formally, one would test the trend function using a Bayesian model selection between a linear and (e.g.,) a polynomial. I did this for a few and then eye-balled the rest. So, please take the following as an indication only.
Disclaimer: The smallest model for GPT-3 has 108 parameters, the largest 1011. That’s a span of 3 orders of magnitude. Scaling this out to many more orders of magnitude is dangerous. Thus, take these numbers only as an indication.
Results. For the following tests, I find an asymptotic trend. Scaling the model will apparently not yield fantastic results for:
TriviaQA (1013 parameter estimate to achieve 100%)
BoolQ (1015)
MultiRC (1016)
ARC (1016)
SuperGLUE (1018)
WSC (1020)
WebQs (1021)
Cycled (1023)
Some tests scale neither linear nor asymptotic:
Symbol: Near exponential (1012)
Arithmetic: Exponential; one-digit composite may achieve 100% at 1014
Reversed: Near exponential (1016)
Anagrams: Polynomial (1019)
ANLI: stepped, unclear
RTE: stepped, unclear
Summary: About half of the tested skills will likely not scale much with larger models. The other half will (e.g., TriviaQA, SuperGLUE, arithmetic, anagrams). Going to e.g., 1016 parameters—would that make an Oracle AI? Probably it’s not sufficient, but I’m interested in hearing your opinion!
Predictions for GPT-N
Regarding GPT-3, there is some discussion whether growing the model would transform it into an Oracle AI. I looked into the actual benchmark results (Appendix H in the paper) to see if we can predict something useful from the actual measurements.
Method: The OpenAI team ran a suite of 63 different benchmarks (including sub-types), each for zero/one/few shot. In each scenario, there are 8 model sizes. I looked at how results scale with model size. With only 8 measurements, there is a large associated uncertainty for predictions. Formally, one would test the trend function using a
Bayesian model selection between a linear and (e.g.,) a polynomial. I did this for a few and then eye-balled the rest. So, please take the following as an indication only.
Disclaimer: The smallest model for GPT-3 has 108 parameters, the largest 1011. That’s a span of 3 orders of magnitude. Scaling this out to many more orders of magnitude is dangerous. Thus, take these numbers only as an indication.
Results. For the following tests, I find an asymptotic trend. Scaling the model will apparently not yield fantastic results for:
HellaSwag, LAMBADA, PIQA, CoQA, OpenBookQA, Quac, RACE, CB, ReCoRD, WiC
Translations—but unclear level description.
In the following tests, it is unclear if the trend is asymptotic or better than that:
SAT: Could be linear, could be asymptotic. If linear, it will achieve 100% at 1016 parameters.
StoryCloze, Winograd, Winogrande, SQuADv2, DROP, Copa.
These tests show a linear scaling:
TriviaQA (1013 parameter estimate to achieve 100%)
BoolQ (1015)
MultiRC (1016)
ARC (1016)
SuperGLUE (1018)
WSC (1020)
WebQs (1021)
Cycled (1023)
Some tests scale neither linear nor asymptotic:
Symbol: Near exponential (1012)
Arithmetic: Exponential; one-digit composite may achieve 100% at 1014
Reversed: Near exponential (1016)
Anagrams: Polynomial (1019)
ANLI: stepped, unclear
RTE: stepped, unclear
Summary: About half of the tested skills will likely not scale much with larger models. The other half will (e.g., TriviaQA, SuperGLUE, arithmetic, anagrams). Going to e.g., 1016 parameters—would that make an Oracle AI? Probably it’s not sufficient, but I’m interested in hearing your opinion!