Predictions for GPT-N

Regarding GPT-3, there is some discussion whether growing the model would transform it into an Oracle AI. I looked into the actual benchmark results (Appendix H in the paper) to see if we can predict something useful from the actual measurements.

Method: The OpenAI team ran a suite of 63 different benchmarks (including sub-types), each for zero/one/few shot. In each scenario, there are 8 model sizes. I looked at how results scale with model size. With only 8 measurements, there is a large associated uncertainty for predictions. Formally, one would test the trend function using a
Bayesian model selection between a linear and (e.g.,) a polynomial. I did this for a few and then eye-balled the rest. So, please take the following as an indication only.

Disclaimer: The smallest model for GPT-3 has $10^{8}$ parameters, the largest $10^{11}$ . That’s a span of 3 orders of magnitude. Scaling this out to many more orders of magnitude is dangerous. Thus, take these numbers only as an indication.

Results. For the following tests, I find an asymptotic trend. Scaling the model will apparently not yield fantastic results for:

HellaSwag, LAMBADA, PIQA, CoQA, OpenBookQA, Quac, RACE, CB, ReCoRD, WiC
Translations—but unclear level description.

In the following tests, it is unclear if the trend is asymptotic or better than that:

SAT: Could be linear, could be asymptotic. If linear, it will achieve 100% at $10^{16}$ parameters.
StoryCloze, Winograd, Winogrande, SQuADv2, DROP, Copa.

These tests show a linear scaling:

TriviaQA ( $10^{13}$ parameter estimate to achieve 100%)
BoolQ ( $10^{15}$ )
MultiRC ( $10^{16}$ )
ARC ( $10^{16}$ )
SuperGLUE ( $10^{18}$ )
WSC ( $10^{20}$ )
WebQs ( $10^{21}$ )
Cycled ( $10^{23}$ )

Some tests scale neither linear nor asymptotic:

Symbol: Near exponential ( $10^{12}$ )
Arithmetic: Exponential; one-digit composite may achieve 100% at $10^{14}$
Reversed: Near exponential ( $10^{16}$ )
Anagrams: Polynomial ( $10^{19}$ )
ANLI: stepped, unclear
RTE: stepped, unclear

Summary: About half of the tested skills will likely not scale much with larger models. The other half will (e.g., TriviaQA, SuperGLUE, arithmetic, anagrams). Going to e.g., $10^{16}$ parameters—would that make an Oracle AI? Probably it’s not sufficient, but I’m interested in hearing your opinion!