I think this is a very interesting discussion, and I enjoyed your exposition. However, the piece fails to engage with the technical details or existing literature, to its detriment.
Take your first example, “Tricking GPT-3”. GPT is not: give someone a piece of paper and ask them to finish it. GPT is: You sit behind one way glass watching a man at a typewriter. After every key he presses you are given a chance to press a key on an identical typewriter of your own. If typewriter-man’s next press does not match your prediction, you get an electric shock. You always predict every keystroke, even before he starts typing.
In this situation, would a human really do better? They might well begin a “proper continuation” after rule 3 only to receive a nasty shock when the typist continues “4. ”. Surely by rule 11 a rule 12 is ones best guess? And recall that GPT in its auto-regressive generation mode experiences text in exactly the same way as when simply predicting; there is no difference in its operation, only in how we interpret that operation. So after 12 should come 13, 14… There are several other issues with the prompt, but this is the most egregious.
As for Winograd, the problem of surface associations mimicking deeper understanding is well known. All testing today is done on WinoGrande which is strongly debiased and even adversarially mined (see in particular page 4 figure 1). GPT-3 0-shot scores (70%) well below the human level (94%) but also well above chance (50%). For comparison, BERT (340 million param) 0-shot scores 50.2%.
There are also cases, like multiplication, where GPT-3 unequivocally extracts a deeper “world model”, demonstrating that it is at least possible to do so as a language model.
Of course, all of this is likely to be moot! Since GPT-3′s release, a primary focus of research has been multimodality, which provides just the sort of grounding you desire. It’s very difficult to argue that CLIP, for instance, doesn’t know what an avocado looks like, or that these multimodal agents from Deepmind aren’t grounded as they follow natural language instructions (video, top text is received instruction).
In all, I find the grounding literature interesting but I remain unconvinced it puts any limits on the capabilities even of the simplest unimodal, unagentic models (unlike, say, the causality literature).
I think this is a very interesting discussion, and I enjoyed your exposition. However, the piece fails to engage with the technical details or existing literature, to its detriment.
Take your first example, “Tricking GPT-3”. GPT is not: give someone a piece of paper and ask them to finish it. GPT is: You sit behind one way glass watching a man at a typewriter. After every key he presses you are given a chance to press a key on an identical typewriter of your own. If typewriter-man’s next press does not match your prediction, you get an electric shock. You always predict every keystroke, even before he starts typing.
In this situation, would a human really do better? They might well begin a “proper continuation” after rule 3 only to receive a nasty shock when the typist continues “4. ”. Surely by rule 11 a rule 12 is ones best guess? And recall that GPT in its auto-regressive generation mode experiences text in exactly the same way as when simply predicting; there is no difference in its operation, only in how we interpret that operation. So after 12 should come 13, 14… There are several other issues with the prompt, but this is the most egregious.
As for Winograd, the problem of surface associations mimicking deeper understanding is well known. All testing today is done on WinoGrande which is strongly debiased and even adversarially mined (see in particular page 4 figure 1). GPT-3 0-shot scores (70%) well below the human level (94%) but also well above chance (50%). For comparison, BERT (340 million param) 0-shot scores 50.2%.
There are also cases, like multiplication, where GPT-3 unequivocally extracts a deeper “world model”, demonstrating that it is at least possible to do so as a language model.
Of course, all of this is likely to be moot! Since GPT-3′s release, a primary focus of research has been multimodality, which provides just the sort of grounding you desire. It’s very difficult to argue that CLIP, for instance, doesn’t know what an avocado looks like, or that these multimodal agents from Deepmind aren’t grounded as they follow natural language instructions (video, top text is received instruction).
In all, I find the grounding literature interesting but I remain unconvinced it puts any limits on the capabilities even of the simplest unimodal, unagentic models (unlike, say, the causality literature).
The multiplication example is good, and I should have thought about it and worked it into the post.