This post describes the author’s insights from extrapolating the performance of GPT on the benchmarks presented in the <@GPT-3 paper@>(@Language Models are Few-Shot Learners@). The author compares cross-entropy loss (which measures how good a model is at predicting the next token) with benchmark performance normalized to the difference between random performance and the maximum possible performance. Since <@previous work@>(@Scaling Laws for Neural Language Models@) has shown that cross-entropy loss scales smoothly with model size, data, and FLOP requirements, we can then look at the overall relationship between those inputs and benchmark performance.
The author finds that most of the benchmarks scale smoothly and similarly with respect to cross-entropy loss. Three exceptions are arithmetic, scramble (shuffling letters around the right way), and ANLI (a benchmark generated adversarially against transformer-based language models), which don’t improve until the very end of the cross-entropy loss range. The author fits linear and s-shaped curves to these relationships, and guesses that:
- Performance improvements are likely to slow down closer to maximum performance, making s-curves a better progress estimate than linear curves. - Machine learning models may use very different reasoning from humans to get good performance on a given benchmark, so human-level performance on any single benchmark would likely not be impressive, but human-level performance on almost all of them with few examples might be. - We might care about the point where we can achieve human-level performance on all tasks with a 1 token “horizon length”—i.e., all tasks where just 1 token is enough of a training signal to understand how a change in the model affects its performance. (See <@this AI timelines report draft@>(@Draft report on AI timelines@) for more on horizon length.) Achieving this milestone is likely to be _more_ difficult than getting to human-level performance on these benchmarks, but since scaling up GPT is just one way to do these tasks, the raw number of parameters required for this milestone could just as well be _less_ than the number of parameters that GPT needs to beat the benchmarks. - Human-level performance on these benchmarks would likely be enough to automate lots of particular short horizon length tasks, such as customer service, PA and RA work, and writing routine sections of code.
The author augments his s-curves graph with references to certain data, FLOP, and parameter levels, including the number of words in common crawl, the number of FLOPs that could currently be bought for $1B, the point where reading or writing one word would cost 1 cent, and the number of parameters in a transformative model according to <@this AI timelines report draft@>(@Draft report on AI timelines@). (I recommend looking at the graph of these references to see their relationship to the benchmark trends.)
Overall, the author concludes that:
- GPT-3 is in line with smooth performance on benchmarks predicted by smaller models. It sharply increases performance on arithmetic and scramble tasks, which the author thinks is because the tasks are ‘narrow’ in that they are easy once you understand their one trick. The author now finds it less likely that a small amount of scaling will suddenly lead to a large jump in performance on a wide range of tasks. - Close to optimal performance on these benchmarks seems like it’s at least ~3 orders of magnitude away ($1B at current prices). The author thinks more likely than not, we’d get there after increasing the training FLOP by ~5-6 orders of magnitude ($100B -$1T at current prices, $1B - $10B given estimated hardware and software improvements over the next 5 − 10 years). The author thinks this would probably not be enough to be transformative, but thinks we should prepare for that eventuality anyway. - The number of parameters estimated for human-equivalent performance on these benchmarks (~1e15) is close to the median number of parameters given in <@this AI timelines report draft@>(@Draft report on AI timelines@), which is generated via comparison to the computation done in the human brain.
Planned opinion:
Ask and ye shall receive! In my <@last summary@>(@Scaling Laws for Autoregressive Generative Modeling@), I mentioned that I was uncertain about how cross-entropy loss translates to transformative progress that we care about, and here is an excellent post exploring just that question. I’m sure I’ll end up referencing this many times in the future.
The post discusses both what benchmarks might suggest for forecasting “human equivalence” and how benchmarks might relate to economic value via concrete task automation. I agree with the tasks the author suggests for the latter, and continuing my “opinions as calls for more work” trend, I’d be interested in seeing even more work on this—i.e. attempts to decompose tasks into a set of concrete benchmark performances which would be sufficient for economically valuable automation. This comment thread discusses whether current benchmarks are likely to capture a substantial portion of what is necessary for economic value, given that many jobs end up requiring a diverse portfolio of skills and reasoning ability. It seems plausible to me that AI-powered automation will be “discontinuous” in that a lot of it will be unlocked only when we have a system that’s fairly general.
It seems quite noteworthy that the parameter estimates here and in the AI timelines report draft are close together, even though one is anchored to human-level benchmark performance, and the other is anchored to brain computation. That updates me in the direction of those numbers being in the right range for human-like abilities.
People interested in this post maybe also be interested in [BIG-bench](https://github.com/google/BIG-Bench), a project to crowdsource the mother of all benchmarks for language models.
Planned summary for the Alignment Newsletter:
Planned opinion: