The biggest <@GPT-2 model@>(@Better Language Models and Their Implications@) had 1.5 billion parameters, and since its release people have trained language models with up to 17 billion parameters. This paper reports GPT-3 results, where the largest model has _175 billion_ parameters, a 10x increase over the previous largest language model. To get the obvious out of the way, it sets a new state of the art (SOTA) on zero-shot language modeling (evaluated only on Penn Tree Bank, as other evaluation sets were accidentally a part of their training set).
The primary focus of the paper is on analyzing the _few-shot learning_ capabilities of GPT-3. In few-shot learning, after an initial training phase, at test time models are presented with a small number of examples of a new task, and then must execute that task for new inputs. Such problems are usually solved using _meta-learning_ or _finetuning_, e.g. at test time MAML takes a few gradient steps on the new examples to produce a model finetuned for the test task. In contrast, the key hypothesis with GPT-3 is that language is so diverse, that doing well on it already requires adaptation to the input, and so the learned language model will _already be a meta-learner_. This implies that they can simply “”prime”“ the model with examples of a task they care about, and the model can _learn_ what task is supposed to be performed, and then perform that task well.
For example, consider the task of generating a sentence using a new-made up word whose meaning has been explained. In one notable example, the prompt for GPT-3 is:
_A “”whatpu”” is a small, furry animal native to Tanzania. An example of a sentence that uses the word whatpu is:_ _We were traveling in Africa and we saw these very cute whatpus._ _To do a “”farduddle”″ means to jump up and down really fast. An example of a sentence that uses the word farduddle is:_
Given this prompt, GPT-3 generates the following output:
_One day when I was playing tag with my little sister, she got really excited and she started doing these crazy farduddles._
The paper tests on several downstream tasks for which benchmarks exist (e.g. question answering), and reports zero-shot, one-shot, and few-shot performance on all of them. On some tasks, the few-shot version sets a new SOTA, _despite not being finetuned using the benchmark’s training set_; on others, GPT-3 lags considerably behind finetuning approaches.
The paper also consistently shows that few-shot performance increases as the number of parameters increase, and the rate of increase is faster than the corresponding rate for zero-shot performance. While they don’t outright say it, we might take this as suggestive evidence that as models get larger, they are more incentivized to learn “general reasoning abilities”.
The most striking example of this is in arithmetic, where the smallest 6 models (up to 6.7 billion parameters) have poor performance (< 20% on 2-digit addition), then the next model (13 billion parameters) jumps to > 50% on 2-digit addition and subtraction, and the final model (175 billion parameters) achieves > 80% on 3-digit addition and subtraction and a perfect 100% on 2-digit addition (all in the few-shot regime). They explicitly look for their test problems in the training set, and find very few examples, suggesting that the model really is learning “how to do addition”; in addition, when it is incorrect, it tends to be mistakes like “forgetting to carry a 1”.
On broader impacts, the authors talk about potential misuse, fairness and bias concerns, and energy usage concerns, and say about what you’d expect. One interesting note: “To understand how low and mid-skill actors think about language models, we have been monitoring forums and chat groups where misinformation tactics, malware distribution, and computer fraud are frequently discussed.” They find that while there was significant discussion of misuse, they found no successful deployments. They also consulted with professional threat analysts about the possibility of well-resourced actors misusing the model. According to the paper: “The assessment was that language models may not be worth investing significant resources in because there has been no convincing demonstration that current language models are significantly better than current methods for generating text, and because methods for “targeting” or “controlling” the content of language models are still at a very early stage.”
Planned opinion:
For a long time, I’ve heard people quietly hypothesizing that with a sufficient diversity of tasks, regular gradient descent could lead to general reasoning abilities allowing for quick adaptation to new tasks. This is a powerful demonstration of this hypothesis.
One critique is that GPT-3 still takes far too long to “identify” a task—why does it need 50 examples of addition in order to figure out that what it should do is addition? Why isn’t 1 sufficient? It’s not like there are a bunch of other conceptions of “addition” that need to be disambiguated. I’m not sure what’s going on mechanistically, but we can infer from the paper that as language models get larger, the number of examples needed to achieve a given level of performance goes down, so it seems like there is some “strength” of general reasoning ability that goes up. Still, it would be really interesting to figure out mechanistically how the model is “reasoning”.
This also provides some empirical evidence in support of the threat model underlying <@inner alignment concerns@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@): they are predicated on neural nets that implicitly learn to optimize. (To be clear, I think it provides empirical support for neural nets learning to “reason generally”, not neural nets learning to implicitly “perform search” in pursuit of a “mesa objective”—see also <@Is the term mesa optimizer too narrow?@>.
Planned summary for the Alignment Newsletter:
Planned opinion: