Upon reflection, the structured sentences, thematically resolved paragraphs, and even JSX code can be done without a lot of real lookahead. And there’s some evidence it’s not doing lookahead—its difficulty completing rhymes when writing poetry, for instance.
(Hmm, what’s the simplest game that requires lookahead that we could try to teach to GPT-3, such that it couldn’t just memorize moves?)
Thinking about this more, I think that since planning depends on causal modeling, I’d expect the latter to get good before the former. But I probably overstated the case for its current planning capabilities, and I’ll edit accordingly. Thanks!
Yes! I was thinking about this yesterday, it occurred to me that GPT-3′s difficulty with rhyming consistently might not just be a byte-pair problem, any highly structured text with extremely specific, restrictive forward and backward dependencies is going to be a challenge if you’re just linearly appending one token at a time onto a sequence without the ability to revise it (maybe we should try a 175-billion parameter BERT?). That explains and predicts a broad spectrum of issues and potential solutions (here I’m calling them A, B and C): performance should correlate to (1) the allowable margin of error per token-group (coding syntax is harsh, solving math equations is harsh, trying to come up with a rhyme for ‘orange’ after you’ve written it is harsh), and (2) the extent to which each token-group depends on future token-groups. Human poets and writers always go through several iterations, but we’re asking it to do what we do in just one pass.
So in playing around with GPT-3 (AID), I’ve found two (three?) meta approaches for dealing with this issue. I’ll call them Strategies A, B and C.
A is the more general one. You just give it multiple drafting opportunities and/or break up the problem into multiple smaller steps. So far I’ve seen it work for:
(1) Boolean logic, algebraic equations, simple math equations works (guess-and-check). When I have time in a few days, I’m going to get it to mimic the human heuristic for calculating approximate square-roots over multiple iterations.
(2) Translating Chinese poems to English roughly and then touching them up in the second draft. Same with editing any kind of text.
(3) Tricky coding problems (specifically, transforming a string into Pig Latin). First, instead of asking it to “solve the problem”, you ask it to “come up with five possible strategies for solving the problem”, and then “select the most plausible one”. Then you say “you made several structural, syntactical, and interpretive mistakes”, allow it to come up with a long list of those possible mistakes, say, “now try again”, and do that as many times as the context window allows. The end result isn’t always functional, but it’s a lot better than asking it to solve something in one pass.
B is the moderately less general, and more obvious second approach, which synergises well with the first approach. B is forcing GPT-3 to plan explicitly.
(1) In writing an article, you get GPT-3 to start by writing a vague summary, then a more in-depth summary, then listing the key points and subpoints in order. By periodically forcing it to summarise its discussion up to a given point, you can exceed the window length while retaining coherency.
(2) In writing poetry from a prompt, you get GPT-3 to discuss and tease out the implications of the prompt and describe the process of planning the poetry first.
(3) In translating, you get it to list out the key potential translation errors that could be made, and the different choices a translator could make in translating each line.
(4) In writing code, you get GPT-3 to simulate several people discussing the problem requirements and arguing constructively with one another (simulating just one person means if that one person goes off track or misinterprets the problem, future continuations are poisoned with the error since they need to be consistent), then producing English pseudo-code that describes the process in abstract, and only then the actual code.
I decided to add ‘simulating multiple people’ as a Strategy C, but it’s kind of the same thing as Strategy A but in a way that allows more room for error. The issue is that in most single-author texts, people try to be consistent with what they’ve said before, but in GPT-3, this can cause minor errors (for instance, self-contradiction) to accumulate over time, which reduces generation quality. But we’ve seen that something as simple as adding dialogue between two people, allows GPT-3 to arrive at accurate and more complex solutions much more reliably. This works for a broad spectrum of media: articles, poetry, translation, and coding. All you need to do is create a ‘critic’ who interrupts after each line or paragraph, and then if you really need one, a critic who criticises the first critic. The key here is constructive rather than destructive criticism, since GPT-3 is perfectly capable of producing vacuous and petty critiques.
All three of these strategies together tend to vastly improve performance on tasks where (1) the allowable margin of error per token-group is quite small (for instance, solving 83x42), and (2) current token-groups depends on future token-groups. I have not tested this for rhyming, but it seems simple enough to check.
In other words, GPT-3 does better at solving problems when you get it to simulate the way humans solve problems: with multiple attempts, with explicit planning, and by collaborating with other humans.
Edit: my attempts at making GPT-3 rhyme failed. Here is what I tried, and what I figured out.
(1) It has a vague idea of rhyming—if you fill its context-window with groups of words that rhyme, about 40-60% of the words in its next generation will rhyme, and the rest will look like rhymes (as in, they end with the same couple of letters but are pronounced differently in English—e.g dough, cough, rough, etc.).
(1a) Most rhyming websites are query-based. From what I could tell, GPT-3 has not memorised the layout of the most common rhyming websites to the degree where it could reproduce the formatting consistently. This is not surprising given that Common Crawl abides by nofollow and robots.txt policies, and that OpenAI may have filtered these pages out when they were paring the dataset down to ‘high-quality’ documents.
(1b) GPT-3 knows how most Chinese words are pronounced, even if it gets the tone wrong sometimes. It rhymes more consistently in languages with uncommon diacritic markings, more with languages that don’t use Latin characters, and even more consistently in non-Latin-based languages with phonemic orthography, but not by much. With Russian, you hit the jackpot—BPE represents it as individual characters, it’s mostly phonemic, there’s a lot of Russian in GPT-3′s dataset, and a lot of rhyming poetry—but it still does poorly. This either suggests that an absence of looking forward + randomness introduced by sampling is the main issue here. Unfortunately the other most-well-represented languages in its dataset with non-Latin phonemic orthography (Japanese kana, Korean hangul, Arabic script) each have their own issues—rhyming the last syllable of each line in Korean is easy since it’s an SOV language and all you have to do is match the verb conjugation, so it doesn’t have much literary value. Most of the rhyming in the dataset would likely be modern rap, which sometimes uses multiple syllables. Arabic omits short vowels. Japanese I know less about, but iirc rhyming is much less common than other forms of constrained writing (e.g haiku) that emphasise rhythm, and mostly occurs in j-pop.
(2) Giving it multiple attempts failed. ‘Multiple generations for each line + selecting the ones that rhyme’ works, but we already know that.
(3) Supplying rhymes kind of worked. It would do well for a handful lines and then go off track. Giving it multiple possible choices was very bad. It would include the words randomly within lines, or near the end of lines, and sometimes at the very end. This might be rectified by more examples, since AID is limited to 1000 tokens/characters. But I do suspect the issue is a more fundamental one.
(4) Splitting words into syllables failed, but I didn’t try this one exhaustively. The only benefit of word-splitting occurs when the beginning of the word matters (e.g alliteration), because it allows for ‘denser’ computation per token (on the character/syllable level, not the word level). Plus, we’re talking about the English language. Even actual English speakers regularly have trouble with knowing how words are pronounced, orthography kind of hinders rather than helps in this case.
(5) ‘Reminding’ it of the end word between each line failed.
(6) Forcing it to generate in IPA first did not work. However, it does have a vague idea of how to transliterate English into IPA and a better idea of how to transliterate IPA into English.
(7) Future attempts: my prompting was very abstract, and we know that GPT-3 works better when there’s a familiar context surrounding the task / the prompt is within the training distribution. I will try the context of an English writing assignment.
That’s not a nitpick at all!
Upon reflection, the structured sentences, thematically resolved paragraphs, and even JSX code can be done without a lot of real lookahead. And there’s some evidence it’s not doing lookahead—its difficulty completing rhymes when writing poetry, for instance.
(Hmm, what’s the simplest game that requires lookahead that we could try to teach to GPT-3, such that it couldn’t just memorize moves?)
Thinking about this more, I think that since planning depends on causal modeling, I’d expect the latter to get good before the former. But I probably overstated the case for its current planning capabilities, and I’ll edit accordingly. Thanks!
Yes! I was thinking about this yesterday, it occurred to me that GPT-3′s difficulty with rhyming consistently might not just be a byte-pair problem, any highly structured text with extremely specific, restrictive forward and backward dependencies is going to be a challenge if you’re just linearly appending one token at a time onto a sequence without the ability to revise it (maybe we should try a 175-billion parameter BERT?). That explains and predicts a broad spectrum of issues and potential solutions (here I’m calling them A, B and C): performance should correlate to (1) the allowable margin of error per token-group (coding syntax is harsh, solving math equations is harsh, trying to come up with a rhyme for ‘orange’ after you’ve written it is harsh), and (2) the extent to which each token-group depends on future token-groups. Human poets and writers always go through several iterations, but we’re asking it to do what we do in just one pass.
So in playing around with GPT-3 (AID), I’ve found two (three?) meta approaches for dealing with this issue. I’ll call them Strategies A, B and C.
A is the more general one. You just give it multiple drafting opportunities and/or break up the problem into multiple smaller steps. So far I’ve seen it work for:
(1) Boolean logic, algebraic equations, simple math equations works (guess-and-check). When I have time in a few days, I’m going to get it to mimic the human heuristic for calculating approximate square-roots over multiple iterations.
(2) Translating Chinese poems to English roughly and then touching them up in the second draft. Same with editing any kind of text.
(3) Tricky coding problems (specifically, transforming a string into Pig Latin). First, instead of asking it to “solve the problem”, you ask it to “come up with five possible strategies for solving the problem”, and then “select the most plausible one”. Then you say “you made several structural, syntactical, and interpretive mistakes”, allow it to come up with a long list of those possible mistakes, say, “now try again”, and do that as many times as the context window allows. The end result isn’t always functional, but it’s a lot better than asking it to solve something in one pass.
B is the moderately less general, and more obvious second approach, which synergises well with the first approach. B is forcing GPT-3 to plan explicitly.
(1) In writing an article, you get GPT-3 to start by writing a vague summary, then a more in-depth summary, then listing the key points and subpoints in order. By periodically forcing it to summarise its discussion up to a given point, you can exceed the window length while retaining coherency.
(2) In writing poetry from a prompt, you get GPT-3 to discuss and tease out the implications of the prompt and describe the process of planning the poetry first.
(3) In translating, you get it to list out the key potential translation errors that could be made, and the different choices a translator could make in translating each line.
(4) In writing code, you get GPT-3 to simulate several people discussing the problem requirements and arguing constructively with one another (simulating just one person means if that one person goes off track or misinterprets the problem, future continuations are poisoned with the error since they need to be consistent), then producing English pseudo-code that describes the process in abstract, and only then the actual code.
I decided to add ‘simulating multiple people’ as a Strategy C, but it’s kind of the same thing as Strategy A but in a way that allows more room for error. The issue is that in most single-author texts, people try to be consistent with what they’ve said before, but in GPT-3, this can cause minor errors (for instance, self-contradiction) to accumulate over time, which reduces generation quality. But we’ve seen that something as simple as adding dialogue between two people, allows GPT-3 to arrive at accurate and more complex solutions much more reliably. This works for a broad spectrum of media: articles, poetry, translation, and coding. All you need to do is create a ‘critic’ who interrupts after each line or paragraph, and then if you really need one, a critic who criticises the first critic. The key here is constructive rather than destructive criticism, since GPT-3 is perfectly capable of producing vacuous and petty critiques.
All three of these strategies together tend to vastly improve performance on tasks where (1) the allowable margin of error per token-group is quite small (for instance, solving 83x42), and (2) current token-groups depends on future token-groups. I have not tested this for rhyming, but it seems simple enough to check.
In other words, GPT-3 does better at solving problems when you get it to simulate the way humans solve problems: with multiple attempts, with explicit planning, and by collaborating with other humans.
Edit: my attempts at making GPT-3 rhyme failed. Here is what I tried, and what I figured out.
(1) It has a vague idea of rhyming—if you fill its context-window with groups of words that rhyme, about 40-60% of the words in its next generation will rhyme, and the rest will look like rhymes (as in, they end with the same couple of letters but are pronounced differently in English—e.g dough, cough, rough, etc.).
(1a) Most rhyming websites are query-based. From what I could tell, GPT-3 has not memorised the layout of the most common rhyming websites to the degree where it could reproduce the formatting consistently. This is not surprising given that Common Crawl abides by nofollow and robots.txt policies, and that OpenAI may have filtered these pages out when they were paring the dataset down to ‘high-quality’ documents.
(1b) GPT-3 knows how most Chinese words are pronounced, even if it gets the tone wrong sometimes. It rhymes more consistently in languages with uncommon diacritic markings, more with languages that don’t use Latin characters, and even more consistently in non-Latin-based languages with phonemic orthography, but not by much. With Russian, you hit the jackpot—BPE represents it as individual characters, it’s mostly phonemic, there’s a lot of Russian in GPT-3′s dataset, and a lot of rhyming poetry—but it still does poorly. This either suggests that an absence of looking forward + randomness introduced by sampling is the main issue here. Unfortunately the other most-well-represented languages in its dataset with non-Latin phonemic orthography (Japanese kana, Korean hangul, Arabic script) each have their own issues—rhyming the last syllable of each line in Korean is easy since it’s an SOV language and all you have to do is match the verb conjugation, so it doesn’t have much literary value. Most of the rhyming in the dataset would likely be modern rap, which sometimes uses multiple syllables. Arabic omits short vowels. Japanese I know less about, but iirc rhyming is much less common than other forms of constrained writing (e.g haiku) that emphasise rhythm, and mostly occurs in j-pop.
(2) Giving it multiple attempts failed. ‘Multiple generations for each line + selecting the ones that rhyme’ works, but we already know that.
(3) Supplying rhymes kind of worked. It would do well for a handful lines and then go off track. Giving it multiple possible choices was very bad. It would include the words randomly within lines, or near the end of lines, and sometimes at the very end. This might be rectified by more examples, since AID is limited to 1000 tokens/characters. But I do suspect the issue is a more fundamental one.
(4) Splitting words into syllables failed, but I didn’t try this one exhaustively. The only benefit of word-splitting occurs when the beginning of the word matters (e.g alliteration), because it allows for ‘denser’ computation per token (on the character/syllable level, not the word level). Plus, we’re talking about the English language. Even actual English speakers regularly have trouble with knowing how words are pronounced, orthography kind of hinders rather than helps in this case.
(5) ‘Reminding’ it of the end word between each line failed.
(6) Forcing it to generate in IPA first did not work. However, it does have a vague idea of how to transliterate English into IPA and a better idea of how to transliterate IPA into English.
(7) Future attempts: my prompting was very abstract, and we know that GPT-3 works better when there’s a familiar context surrounding the task / the prompt is within the training distribution. I will try the context of an English writing assignment.