I don’t believe rhymes are an example of a failure to plan. They are a clearcut case of BPE problems.
They follow the same patterns as other BPE problems: works on the most common (memorized) instances, rapidly degrading with rarity, the relevant information cannot be correctly represented by BPEs, they are inherently simple yet GPT-3 performs really badly despite human-like performance on almost identical tasks (like non-rhyming poetry, or non-pun based humor), and have improved minimally over GPT-2. With rhymes, it’s even more clearly not a planning problem because Peter Vessenes, I think, on the Slack set up a demo problem where the task was merely to select the rhyming word for a target word out of a prespecified list of possible rhymes; in line with BPEs, GPT-3 could correctly select short common rhyme pairs, and then fell apart as soon as you used rarer words. Similarly, I found little gain for prespecified rhymes. The problem is not that GPT-3 can’t plan good rhymes, the problem is that GPT-3 doesn’t know what words rhyme, period.
As far as planning goes, next-token prediction is entirely consistent with implicit planning. During each forward pass, GPT-3 probably has plenty of slack computation going on as tokens will differ widely in their difficulty while GPT-3′s feedforward remains a fixed-size computation; just as GPT-3 is always asking itself what sort of writer wrote the current text, so it can better imitate the language, style, format, structure, knowledge limitations or preferences* and even typos, it can ask what the human author is planning, the better to predict the next token. That it may be operating on its own past completions and there is no actual human author is irrelevant—because pretending really well to be an author who is planning equals being an author who is planning! (Watching how far GPT-3 can push this ‘as if’ imitation process is why I’ve begun thinking about mesa-optimizers and what ‘sufficiently advanced imitation’ may mean in terms of malevolent sub-agents created by the meta-learning outer agent.)
* Matt Brockman on the API Slack enjoys experimenting with prompts like “The AI hates cheese and will never talk about cheese” and seeing if he can get the “AI” in the dialogue to talk about cheese anyway.
Funny thing about BPEs: GPT-3 has to know about the individual letters, because I taught it how to spell both real words and nonsense words. (Prompts in bold, including two where I edited GPT-3′s responses.)
The students were ready to test their spelling. The teacher stood at the front of the class. “Adam, please spell PARTY.” Adam replied, “Party. P-A-R-T-Y.” The teacher said, “Correct. Beatrice, please spell THROUGH.” Beatrice replied, “Through. T-H-O-R-O-U-G-H.” The teacher said, “Incorrect. Through is spelled T-H-R-O-U-G-H. Carlos, please spell SPELL.” Carlos replied, “Spell. S-P-E-L-L.” The teacher said, “Correct. Daphne, please spell NUCLEAR.” Daphne replied, “Nuclear. N-U-C-L-E-A-R.” The teacher said, “Correct. Adam, please spell INFINITE.” Adam replied, “Infinite. I-N-F-I-N-A-T-E.” The teacher replied, “Incorrect. Infinite is spelled I-N-F-I-N-I-T-E. Beatrice, please spell BALLOON.” Beatrice replied, “Balloon. B-A-L-L-O-O-N.” The teacher replied, “Correct. Carlos, please spell ENCLOSURE.” Carlos replied, “Enclosure. I-N-C-L-O-S-U-R-E.” The teacher replied, “Incorrect. Enclosure is spelled E-N-C-L-O-S-U-R-E. Daphne, please spell ELECTRON.” Daphne replied, “Electron. E-L-E-C-T-R-O-N.” The teacher replied, “Correct. Adam, please spell CAT.” Adam replied, “Cat. C-A-T.” The teacher said, “Correct. Carlos, please spell MACHINE.” Carlos replied, “Machine. M-A-C.” The teacher said, “Incorrect. Machine is spelled M-A-C-H-I-N-E. Beatrice, please spell SCIENTIST.” Beatrice replied, “Scientist. S-C-I-E-N-T-I-S-T.” The teacher said, “Correct. Daphne, please spell ASTRONOMER. Daphne replied, “Astronomer. A-S-T-R-O-N-O-M-E-R.” The teacher said, “Correct. Adam, please spell UNIVERSE. Adam replied, “Universe. U-N-I-V-E-R-S-E.” The teacher said, “Correct. Carlos, please spell FLARBLE. Carlos replied, “Flarble. F-L-A-R-B-L-E.
Sure. It’s seen plenty of individual letters (letters have their own BPEs as fallbacks if longer BPEs don’t capture them, AFAIK). Stuff like my acrostics demonstration relies on the fact that GPT-3 has knowledge of letters and can, with some difficulty, manipulate them for various tasks.
(Reply to gwern’s comment but not only addressing gwern.)
Concerning the planning question:
I agree that next-token prediction is consistent with some sort of implicit planning of multiple tokens ahead. I would phrase it a bit differently. Also, “implicit” is doing lot of work here
(Please someone correct me if I say something obviously wrong or silly; I do not know how GPT-3 works, but I will try to say something about how it works after reading some sources [1].)
The bigger point about planning, though, is that the GPTs are getting feedback on one word at a time in isolation. It’s hard for them to learn not to paint themselves into a corner.
To recap what I have thus far got from [1]: GPT-3-like transformers are trained by regimen where the loss function evaluates prediction error of the next word in the sequence given the previous word. However, I am less sure if one can say they do it in isolation. During training (by SGD I figure?), transformer decoder layers have (i) access to previous words in the sequence, and (ii) both attention and feedforward parts of each transformer layer has weights (that are being trained) to compute the output predictions. Also, (iii) the GPT transformer architecture considers all words in each training sequence, left to right, masking the future. And this is done for many meaningful Common Crawl sequences, though exact same sequences won’t repeat.
So, it sounds a bit trivial that GPTs trained weights allow “implicit planning”: if given a sequence of words w_1 to w_i-1 GPT would output word w for position i, this is because a trained GPT model (loosely speaking, abstracting away many details I don’t understand) “dynamically encodes” many plausible “word paths” to word w, and [w_1 … w_i-1] is such a path; by iteration, it also encodes many word paths from w to other words w’, where some words are likelier to follow w than others. The representations in the stack of attention and feedforward layers allows it to generate text much more better than eg old good Markov chain. And “self-attending” to some higher-level representation that allows it generate text in particular prose style seems a lot of like a kind of plan. And GPT generating text that it used as input to it, to which it again can selectively “attend to”, this all seems like as a kind of working memory, which will trigger self-attention mechanism to take certain paths, and so on.
I also want highlight oceainthemiddleofanisland’s comment in other thread: Breaking complicated generation tasks into smaller chunks getting GPT to output intermediate text from initial input, which is then given as input to GPT to reprocess, enabling it finally to output desired output, sounds quite compatible to this view.
(On this note, I am not sure what to think of the role of human in the loop here, or in general, how it apparently requires non-trivial work to find a “working” prompt that seeds GPT obtain desired results for some particularly difficult tasks. That there are useful, rich world models “in there somewhere” in GPTs weights, but it is difficult to activate them? And are these difficulties because it is humans are bad at prompting GPT to generate text that accesses the good models, or because GPTs all-together model is not always so impressive as it easily turns into building answers based on gibberish models instead of the good ones, or maybe GPT having a bad internal model of humans attempting to use GPT? Gwern’s example concerning bear attacks was interesting here.)
This would be “implicit planning”. Is it “planning” enough? In any case, the discussion would be easier if we had a clearer definition what would constitute planning and what would not.
Finally, a specific response to gwerns comment.
During each forward pass, GPT-3 probably has plenty of slack computation going on as tokens will differ widely in their difficulty while GPT-3′s feedforward remains a fixed-size computation; just as GPT-3 is always asking itself what sort of writer wrote the current text, so it can better imitate the language, style, format, structure, knowledge limitations or preferences* and even typos, it can ask what the human author is planning, the better to predict the next token. That it may be operating on its own past completions and there is no actual human author is irrelevant—because pretending really well to be an author who is planning equals being an author who is planning! (Watching how far GPT-3 can push this ‘as if’ imitation process is why I’ve begun thinking about mesa-optimizers and what ‘sufficiently advanced imitation’ may mean in terms of malevolent sub-agents created by the meta-learning outer agent.)
Using language how GPT-3 is “pretending” and “asking itself what a human author would do” can be maybe justified as metaphors, but I think it is a bit fuzzy and may obscure differences between what transformers do when we say they “plan” or “pretend”, and what people would assume of beings who “plan” or “pretend”. For example, using a word like “pretend” easily carries over an implication that there is something true, hidden, “unpretense” thinking or personality going on underneath. This appears quite unlikely given a fixed model, and generation mechanism that starts anew from each seed prompt. I would rather say that GPT has a model (is a model?) that is surprisingly good at natural language extrapolation and also, it is surprising at what can be achieved by extrapolation.
I don’t believe rhymes are an example of a failure to plan. They are a clearcut case of BPE problems.
They follow the same patterns as other BPE problems: works on the most common (memorized) instances, rapidly degrading with rarity, the relevant information cannot be correctly represented by BPEs, they are inherently simple yet GPT-3 performs really badly despite human-like performance on almost identical tasks (like non-rhyming poetry, or non-pun based humor), and have improved minimally over GPT-2. With rhymes, it’s even more clearly not a planning problem because Peter Vessenes, I think, on the Slack set up a demo problem where the task was merely to select the rhyming word for a target word out of a prespecified list of possible rhymes; in line with BPEs, GPT-3 could correctly select short common rhyme pairs, and then fell apart as soon as you used rarer words. Similarly, I found little gain for prespecified rhymes. The problem is not that GPT-3 can’t plan good rhymes, the problem is that GPT-3 doesn’t know what words rhyme, period.
As far as planning goes, next-token prediction is entirely consistent with implicit planning. During each forward pass, GPT-3 probably has plenty of slack computation going on as tokens will differ widely in their difficulty while GPT-3′s feedforward remains a fixed-size computation; just as GPT-3 is always asking itself what sort of writer wrote the current text, so it can better imitate the language, style, format, structure, knowledge limitations or preferences* and even typos, it can ask what the human author is planning, the better to predict the next token. That it may be operating on its own past completions and there is no actual human author is irrelevant—because pretending really well to be an author who is planning equals being an author who is planning! (Watching how far GPT-3 can push this ‘as if’ imitation process is why I’ve begun thinking about mesa-optimizers and what ‘sufficiently advanced imitation’ may mean in terms of malevolent sub-agents created by the meta-learning outer agent.)
* Matt Brockman on the API Slack enjoys experimenting with prompts like “The AI hates cheese and will never talk about cheese” and seeing if he can get the “AI” in the dialogue to talk about cheese anyway.
Funny thing about BPEs: GPT-3 has to know about the individual letters, because I taught it how to spell both real words and nonsense words. (Prompts in bold, including two where I edited GPT-3′s responses.)
The students were ready to test their spelling.
The teacher stood at the front of the class. “Adam, please spell PARTY.”
Adam replied, “Party. P-A-R-T-Y.”
The teacher said, “Correct. Beatrice, please spell THROUGH.”
Beatrice replied, “Through. T-H-O-R-O-U-G-H.”
The teacher said, “Incorrect. Through is spelled T-H-R-O-U-G-H. Carlos, please spell SPELL.”
Carlos replied, “Spell. S-P-E-L-L.”
The teacher said, “Correct. Daphne, please spell NUCLEAR.”
Daphne replied, “Nuclear. N-U-C-L-E-A-R.”
The teacher said, “Correct. Adam, please spell INFINITE.”
Adam replied, “Infinite. I-N-F-I-N-A-T-E.”
The teacher replied, “Incorrect. Infinite is spelled I-N-F-I-N-I-T-E. Beatrice, please spell BALLOON.”
Beatrice replied, “Balloon. B-A-L-L-O-O-N.”
The teacher replied, “Correct. Carlos, please spell ENCLOSURE.”
Carlos replied, “Enclosure. I-N-C-L-O-S-U-R-E.”
The teacher replied, “Incorrect. Enclosure is spelled E-N-C-L-O-S-U-R-E. Daphne, please spell ELECTRON.”
Daphne replied, “Electron. E-L-E-C-T-R-O-N.”
The teacher replied, “Correct. Adam, please spell CAT.”
Adam replied, “Cat. C-A-T.”
The teacher said, “Correct. Carlos, please spell MACHINE.”
Carlos replied, “Machine. M-A-C.”
The teacher said, “Incorrect. Machine is spelled M-A-C-H-I-N-E. Beatrice, please spell SCIENTIST.”
Beatrice replied, “Scientist. S-C-I-E-N-T-I-S-T.”
The teacher said, “Correct. Daphne, please spell ASTRONOMER.
Daphne replied, “Astronomer. A-S-T-R-O-N-O-M-E-R.”
The teacher said, “Correct. Adam, please spell UNIVERSE.
Adam replied, “Universe. U-N-I-V-E-R-S-E.”
The teacher said, “Correct. Carlos, please spell FLARBLE.
Carlos replied, “Flarble. F-L-A-R-B-L-E.
----------------
You’ve done much more advanced stuff, of course.
Sure. It’s seen plenty of individual letters (letters have their own BPEs as fallbacks if longer BPEs don’t capture them, AFAIK). Stuff like my acrostics demonstration relies on the fact that GPT-3 has knowledge of letters and can, with some difficulty, manipulate them for various tasks.
(Reply to gwern’s comment but not only addressing gwern.)
Concerning the planning question:
I agree that next-token prediction is consistent with some sort of implicit planning of multiple tokens ahead. I would phrase it a bit differently. Also, “implicit” is doing lot of work here
(Please someone correct me if I say something obviously wrong or silly; I do not know how GPT-3 works, but I will try to say something about how it works after reading some sources [1].)
To recap what I have thus far got from [1]: GPT-3-like transformers are trained by regimen where the loss function evaluates prediction error of the next word in the sequence given the previous word. However, I am less sure if one can say they do it in isolation. During training (by SGD I figure?), transformer decoder layers have (i) access to previous words in the sequence, and (ii) both attention and feedforward parts of each transformer layer has weights (that are being trained) to compute the output predictions. Also, (iii) the GPT transformer architecture considers all words in each training sequence, left to right, masking the future. And this is done for many meaningful Common Crawl sequences, though exact same sequences won’t repeat.
So, it sounds a bit trivial that GPTs trained weights allow “implicit planning”: if given a sequence of words w_1 to w_i-1 GPT would output word w for position i, this is because a trained GPT model (loosely speaking, abstracting away many details I don’t understand) “dynamically encodes” many plausible “word paths” to word w, and [w_1 … w_i-1] is such a path; by iteration, it also encodes many word paths from w to other words w’, where some words are likelier to follow w than others. The representations in the stack of attention and feedforward layers allows it to generate text much more better than eg old good Markov chain. And “self-attending” to some higher-level representation that allows it generate text in particular prose style seems a lot of like a kind of plan. And GPT generating text that it used as input to it, to which it again can selectively “attend to”, this all seems like as a kind of working memory, which will trigger self-attention mechanism to take certain paths, and so on.
I also want highlight oceainthemiddleofanisland’s comment in other thread: Breaking complicated generation tasks into smaller chunks getting GPT to output intermediate text from initial input, which is then given as input to GPT to reprocess, enabling it finally to output desired output, sounds quite compatible to this view.
(On this note, I am not sure what to think of the role of human in the loop here, or in general, how it apparently requires non-trivial work to find a “working” prompt that seeds GPT obtain desired results for some particularly difficult tasks. That there are useful, rich world models “in there somewhere” in GPTs weights, but it is difficult to activate them? And are these difficulties because it is humans are bad at prompting GPT to generate text that accesses the good models, or because GPTs all-together model is not always so impressive as it easily turns into building answers based on gibberish models instead of the good ones, or maybe GPT having a bad internal model of humans attempting to use GPT? Gwern’s example concerning bear attacks was interesting here.)
This would be “implicit planning”. Is it “planning” enough? In any case, the discussion would be easier if we had a clearer definition what would constitute planning and what would not.
Finally, a specific response to gwerns comment.
Using language how GPT-3 is “pretending” and “asking itself what a human author would do” can be maybe justified as metaphors, but I think it is a bit fuzzy and may obscure differences between what transformers do when we say they “plan” or “pretend”, and what people would assume of beings who “plan” or “pretend”. For example, using a word like “pretend” easily carries over an implication that there is something true, hidden, “unpretense” thinking or personality going on underneath. This appears quite unlikely given a fixed model, and generation mechanism that starts anew from each seed prompt. I would rather say that GPT has a model (is a model?) that is surprisingly good at natural language extrapolation and also, it is surprising at what can be achieved by extrapolation.
[1] http://jalammar.github.io/illustrated-gpt2/ , http://peterbloem.nl/blog/transformers and https://amaarora.github.io/2020/02/18/annotatedGPT2.html in addition to skimming original OpenAI papers