To test GPT-3 on another task that is somewhat unusual relative to the typical distribution of text, we collected a set of 374 “SAT analogy” problems [TLBS03]. Analogies are a style of multiple choice question that constituted a section of the SAT college entrance exam before 2005. A typical example is “audacious is to boldness as (a) sanctimonious is to hypocrisy, (b) anonymous is to identity, (c) remorseful is to misdeed, (d) deleterious is to result, (e) impressionable is to temptation”. The student is expected to choose which of the five word pairs has the same relationship as the original word pair; in this example the answer is “sanctimonious is to hypocrisy”. On this task GPT-3 achieves 65.2% in the few-shot setting, 59.1% in the one-shot setting, and 53.7% in the zero-shot setting, whereas the average score among college applicants was 57% [TL05] (random guessing yields 20%). As shown in Figure 3.12, the results improve with scale, with the the full 175 billion model improving by over 10% compared to the 13 billion parameter model.
This seems like a data point in favor of Yudkowsky’s old argument about crossing the human range. I wonder what the standard deviation is for humans answering SAT questions like this; I would guess it is something like 10 percentage points (though probably with a non-normal distribution?) So in this case at least, it looks like all they had to do to get a human-standard-deviation of improvement was add another order of magnitude of compute.
On the other hand this still looks more like a service than part of a path towards general intelligence, even if it’s a very broad, flexible, and fairly general service, but, for example, I don’t expect GPT-3 to do things like come up with things to do, only to do the things it is asked to (although I’m sure there’s some interesting wandering that can happen by applying GPT-3 recursively).
One of the biggest secrets is the project OpenAI is working on next. Sources described it to me as the culmination of its previous four years of research: an AI system trained on images, text, and other data using massive computational resources. A small team has been assigned to the initial effort, with an expectation that other teams, along with their work, will eventually fold in. On the day it was announced at an all-company meeting, interns weren’t allowed to attend. People familiar with the plan offer an explanation: the leadership thinks this is the most promising way to reach AGI.
As Schmidhuber put it: “one modelto rule them all”. Cross-modal learning ought to be much more efficient and give even more human-like reasoning eg https://arxiv.org/abs/1912.02315 GPT-3 is a text-only self-supervised world-model; being unimodal (so no visual transfer from SimCLR or other recent highly-successful image self-supervision) and not benefiting from any RL loops, it has a lot of weaknesses, but it’s a start.
Between the continued scaling, how scaling/pretraining produces ever more human-like systems in terms of performance/adversarial-examples, cross-modal learning, transfer learning working in RL, self-supervised learning suddenly crushing it, the potential of brain imitation learning, the next decade is very exciting indeed (contra predictions that DL will surely top out any time—real soon now, just you wait and see). One can easily imagine a multi-headed architecture where a multimodal GPT-3-like module, trained by self-supervised learning on large text and image and video datasets (like VideoBERT), feeds into a trunk with modules for ALE, DMLab, Dactyl robot arm etc, doing per-task MuZero-style policy-learning+planning, collecting new experience which is fed back into the self-supervised model, enabling it to do zero-shot tasks in games or robotics or text generation from video or text inputs, learning extremely sample-efficiently (and the more so the more tasks it trains on)...
We are increasingly limited mostly by researchers’ ability to actually write and tweak and integrate these darn things.
This seems like a data point in favor of Yudkowsky’s old argument about crossing the human range. I wonder what the standard deviation is for humans answering SAT questions like this; I would guess it is something like 10 percentage points (though probably with a non-normal distribution?) So in this case at least, it looks like all they had to do to get a human-standard-deviation of improvement was add another order of magnitude of compute.
On the other hand this still looks more like a service than part of a path towards general intelligence, even if it’s a very broad, flexible, and fairly general service, but, for example, I don’t expect GPT-3 to do things like come up with things to do, only to do the things it is asked to (although I’m sure there’s some interesting wandering that can happen by applying GPT-3 recursively).
The obvious thing to do here is to plug it into a DRL agent. Something like learning from instructions or from game manuals: Nethack was recently packaged up, so imagine finetuning GPT-3 on the Nethack wiki and then providing text embeddings from GPT-3 to MuZero or Agent57 etc. https://www.technologyreview.com/2020/02/17/844721/ai-openai-moonshot-elon-musk-sam-altman-greg-brockman-messy-secretive-reality/
As Schmidhuber put it: “one model to rule them all”. Cross-modal learning ought to be much more efficient and give even more human-like reasoning eg https://arxiv.org/abs/1912.02315 GPT-3 is a text-only self-supervised world-model; being unimodal (so no visual transfer from SimCLR or other recent highly-successful image self-supervision) and not benefiting from any RL loops, it has a lot of weaknesses, but it’s a start.
Between the continued scaling, how scaling/pretraining produces ever more human-like systems in terms of performance/adversarial-examples, cross-modal learning, transfer learning working in RL, self-supervised learning suddenly crushing it, the potential of brain imitation learning, the next decade is very exciting indeed (contra predictions that DL will surely top out any time—real soon now, just you wait and see). One can easily imagine a multi-headed architecture where a multimodal GPT-3-like module, trained by self-supervised learning on large text and image and video datasets (like VideoBERT), feeds into a trunk with modules for ALE, DMLab, Dactyl robot arm etc, doing per-task MuZero-style policy-learning+planning, collecting new experience which is fed back into the self-supervised model, enabling it to do zero-shot tasks in games or robotics or text generation from video or text inputs, learning extremely sample-efficiently (and the more so the more tasks it trains on)...
We are increasingly limited mostly by researchers’ ability to actually write and tweak and integrate these darn things.
[Deleted]
Doesn’t seem too hard. Here’s a DM example tweeted about today: https://arxiv.org/abs/2005.09382 (videos).