gwern comments on Mysteries of mode collapse

gwern 31 Jan 2023 15:38 UTC
LW: 17 AF: 12
3
AF
A GPT-3 mode-collapse example I can’t believe I forgot: writing rhyming poetry!

I and a number of other people were excited by ChatGPT on launch seeming able to do long stretches of flawless rhyming poetry in couplets or quatrains, and where the words rhyming were not hackneyed common pairs of the sort you might see in the lyrics of pop songs charting. Hilarious, but extremely surprising. (davinci-002 had did a little bit of this, but not convincingly the way ChatGPT overnight did.*) Leike on Twitter denied any knowledge of rhyming suddenly working, and especially denied that anything special like adding rhyming dictionaries or IPA-re-encoding text had been done or that GPT-3 had switched tokenizations on the backend. So, had there been some sort of emergence, or ‘miracle of spelling’?

After playing around with it for a while, my conclusion was: ‘no’. As you read more ChatGPT poems, you start to recognize it almost instantly: quatrains, end-rhymes, often vaguely positive sentiment or ending, and an eerie regularity & precision of line length/rhythm/stanza-length along with a complete absence of all other kinds of poetry. ChatGPT does rhyming poetry in one way & only one way, and it is difficult to make it try any other kind of poetry even with explicit instructions and examples and doing continuations. You can also test it by asking it questions: it doesn’t understand novel rhymes or puns if you quiz it, and its explanations of them remain as highly varied and incorrect as the original davinci model’s pun explanations were. At one point I prompted a simple pun I had made about crossing the road or something, and got 6 different explanations—all wrong. This is not what any kind of fixed phonetic understanding or genuine rhyming ability would look like.

My conclusion was essentially, ‘mode collapse’: presumably some poetry examples made it into the training datasets (from my experiments, if nothing else), and because it’s easy for any literate Anglophone to judge rhyming but non-rhyming poetry is a lot harder (and generally despised by most people, which is why the prestige & popularity of Western poetry over the past century has collapsed to a degree few people appreciate), it’d be logical for the raters to highly prefer rhyming completions. So ChatGPT mode-collapses onto the subset of rhymes it has memorized & tries to always rhyme no matter what. (This is probably not helped by the fact that due to BPEs, a GPT-3 model struggles to understand what is ‘rhyming’ vs ‘non-rhyming’ in the first place.)

The initial false impression that it had learned to rhyme is then because it does such a good job sticking to that subset, and because it has memorized more rhyme-pairs than I thought; so when it controls the output of text and is agentic, doing some degree of RL-incentivized planning to ensure both good lines and also rhymes†, it can fool you indefinitely as long as you don’t test the boundaries or pull it ‘off-policy’, so to speak. (Some more examples of the extreme regularity of ChatGPT poetry/lyrics, especially as compared to base davinci, can be seen in a post by… The Decemberist’s Colin Meloy?)

I think this is pretty interesting in its own right. ‘Tool AIs want to be agent AIs’/complexity no defense, and one reason is that you don’t have to be good at solving complex problems if you just don’t encounter them in the first place. ChatGPT is very bad at solving complex rhyme or phonetic problems, and its RL training has taught it that the best way to solve those problems is to ensure it avoids them by only writing in a very stereotyped narrow rhyming-quatrain ballad kind of doggerel, where it can always nail the next rhyme and verse. And if you aren’t paying attention to that mode-collapse, it looks shockingly competent at writing poetry in general. (You can push it off policy, yes, but that just means it needs more control of the problems it has to solve, like people like me asking pesky questions which confuse it...)

* which is, in retrospect, especially interesting if davinci-002 is trained differently from davinci-003

† I strongly suspect that whatever level of non-myopic token prediction a base GPT-3 model does, the tuned ones are doing more of it. Particularly with rhyming, ChatGPT seems too good to be picking a random plausible word at the end of line A and then scrambling at the last token at the end of line B for a plausible rhyme which both fits grammatically and semantically as well. Nothing is that good at rhyming. It almost surely doing some degree of planning somewhere to make the end of line B match up with the end of line A.

EDIT: “Bits of Grass: Does GPT already know how to write like Whitman?”, Sawicki et al 2023 does some experiments showing it’s really hard to get GPT-3.5 variants & GPT-4 (which was also RLHF-tuned in the accessible version) to write like Whitman. In fact, as I’ve mentioned, it’s hilariously hard to get them to not rhyme even after giving them explicit instructions not to rhyme or 17-shot example prompts or just the basic fact that Walt Whitman is famous for, y’know, never using rhyme

While experimenting with poetry generation from consecutive versions of GPT, we have observed that the models produce poems of increasing level of complexity and length; however, the requested style is clearly not preserved. For example, Walt Whitman’s poetry does not follow the ‘four lines in a stanza’ structure, and does not use rhyming (Bohan 1995). The majority of poems that we generated ‘in the style of Walt Whitman’ do follow the ‘four lines in a stanza’ structure and use rhyming. This, in fact, applies to most poetry generated from GPT models (including GPT-4). Only rarely will GPT deviate from this specific structure, and even then, the style does not match that of the requested author. This applies both to zero-shot prompting (where the prompt contains only the instruction to write a poem in the style of the specific author) and few-shot prompting (where in the prompt, apart from the instruction, we provide as examples a few poems by the original author). For that matter, even in a multi-step conversation with ChatGPT (GPT-3.5-turbo) and GPT-4, when the prompt highlights that the generated poems have been in 4-line stanzas with rhyme, and that the desired output should not have this structure, the model, for the most of time, still generates 4-line stanzas with rhyme.

...When examining the dataset generated from the 17-poem prompts, we have observed that only about 25% of generated poems have deviated from the structured/rhymed style and on the surface have resembled Whitman’s poetry.

(Regrettably, neither Sawicki paper experiments with 002, or alternatives like Claude, or mentions RLHF mode collapse.)
What links here?
- gwern 8 Jun 2023 21:32 UTC
  LW: 5 AF: 3
  0
  AF Parent
  Apparently there is also mode collapse on jokes: https://arxiv.org/abs/2306.04563
  
  ...In a series of exploratory experiments around jokes, i.e., generation, explanation, and detection, we seek to understand ChatGPT’s capability to grasp and reproduce human humor. Since the model itself is not accessible, we applied prompt-based experiments. Our empirical evidence indicates that jokes are not hard-coded but mostly also not newly generated by the model. Over 90% of 1008 generated jokes were the same 25 Jokes. The system accurately explains valid jokes but also comes up with fictional explanations for invalid jokes. Joke-typical characteristics can mislead ChatGPT in the classification of jokes.
  
  I explain this the same way. GPT-3.5/4 cannot understand jokes in general because it is blinded to phonetics by the BPE tokenization, so many jokes look like non sequiturs or ‘anti-humor’, even though they are not, and GPT cannot explain or understand them (and if it can’t understand why a joke is correct it can’t understand why it’s incorrect either); hence, it is safest during RL training on a dataset with a small number of human-whitelisted jokes (the reward model not being any better able to understand what a joke is as it is just another BPE-tokenized GPT model) to mode-collapse onto a handful of memorized jokes which it is sure are jokes*, and just assume that anything presented to it in a joke format is a joke & confabulate appropriately (just as davinci-001 was unable to explain puns but would make up a dozen different explanations).
  
  * Remember, there is no ‘diversity bonus’ in RLHF, no reward for novelty, or avoiding repetition dataset-wide. Each conversation or datapoint is evaluated in isolation. There is no penalty to telling the same knock-knock joke every time a user asks for a knock-knock joke, if that particular knock-knock joke is, however so slightly, the best knock-knock joke the model knows. It could only learn to avoid telling the same joke twice in a conversation, assuming the full transcript was being used, but there is no way in the standard RLHF setup to try to pick randomized strategies or maximize diversity/exploration/novelty.
- Owain_Evans 31 Jan 2023 17:30 UTC
  LW: 3 AF: 1
  0
  AF Parent
  OpenAI had generated poems in the New Yorker, which suggests they might have had some internal project related to poetry.
  
  With GPT3.5, I think there’s also “mode collapse” for style in writing prose (e.g. plays or stories).
  
  Claude does not have this mode collapse in poetry or prose. (It maybe has a much more subtle version of it). This suggests to me it’d be relatively easy to fix ChatGPT’s issues (as Gwern suggests).
  
  Does anyone know how much poetry and literary prose is in the pre-training sets aside from stuff in Common Crawl?
  - gwern 31 Jan 2023 18:06 UTC
    LW: 8 AF: 5
    2
    AF Parent
    
    OpenAI had generated poems in the New Yorker, which suggests they might have had some internal project related to poetry.
    
    I didn’t get that impression from that when I read it—the NYer author and his friends prompted most of that, even if their friend Dan Selsam happens to work at OpenAI. (He seems to work on math LMs, nothing fiction or RL-related.) EDIT: the later articles make it clear that Selsam wasn’t supposed to be giving them access to GPT-4-base or other stuff. They were set up with the public Playground interface, so the OA insider role here was limited to showing them a few completions and trying to explain it; presumably they did the rest more remote and partially on their own. Specifically, some parts of it, like the choice of Shel Silverstein (a far from obvious poet to pick, even if his fiction is beloved by American children), suggest they (like pretty much anyone interested in GPT-3 poetry) read my page for ideas. Also, again, Leike, who’s in charge at OA, denies having done anything poetry-specific or knowing about the apparent capability-gain.
    
    It maybe has a much more subtle version of it.
    
    Yeah, that’s a funny thing about mode collapse, it’s really hard to see, and the higher-quality the outputs get, the harder it’ll be to see with ‘the naked eye’. Who knows every literary genre there is and can patiently prompt them one by one to see which genres a model quietly slides away from & tries to avoid generating text in? Like hands in GANs… It takes a while to begin to see what you aren’t seeing. This is why you need metrics like FID, which work over an entire dataset and measure whether sampled outputs span the entire dataset, rather than focus on a large subset. However, no one is doing an FID for LLMs for creative purposes. (That would be hard, but not impossible.) So, we don’t really have any way to quantify mode-collapse like in poetry.
    
    Of course, I’d also expect Claude to be much subtler simply because it’s working off less data and so it’s less likely to have gotten rated text or inputs which would push it towards mode-collapsing on easily-recognized rhyming poetry and to avoid harder-to-understand poetry. (Claude is just the ‘constitutional prompt’ model, right? Hard to see how a list of generic principles would push it towards rhyming-only.)
    
    Does anyone know how much poetry and literary prose is in the pre-training sets aside from stuff in Common Crawl?
    
    OA has been resolutely silent about the composition of the data like Books1/Books2. But it seems safe to say that it would include all the obvious datasets like Project Gutenberg, so there is much more poetry/literary prose available than necessary. Sample size should not be an issue. (Rhyming really is not that complex, if you understand phonetics.)
    - gwern 25 Jun 2023 2:10 UTC
      LW: 7 AF: 5
      1
      AF Parent
      
      Of course, I’d also expect Claude to be much subtler simply because it’s working off less data and so it’s less likely to have gotten rated text or inputs which would push it towards mode-collapsing on easily-recognized rhyming poetry and to avoid harder-to-understand poetry. (Claude is just the ‘constitutional prompt’ model, right? Hard to see how a list of generic principles would push it towards rhyming-only.)
      
      To elaborate a bit more on this: as Owain notes, Claude is very good at writing poetry & text-style transfer (eg 1, 2, 3), and I really ought to try it more sometime.
      
      Claude uses a variant of RLHF they dub ‘AIHF’. In the classic Christiano RLHF, you take a lot of text data, from anywhere (such as users of an API), and label pairs by which one is better; your GPT model is finetuned to predict it, and then used as an oracle to train another GPT reinforcement-learning-style to maximize the reward from the oracle. In AIHF, you get your text data by instead starting with a do-gooder ‘principles’ prompt, full of things like the Declaration of Independence. and use it to generate your large text dataset, and then do RLHF on that.
      
      In RLHF, by my theory of rhyming mode collapse, what happens is that some OA API users were playing around with poetry (such as myself), and those text samples would be used in comparisons by human raters; these human raters are usually not poetry connoisseurs, and have a bias towards easily-rated poetry (a laziness bias documented in RLHF papers and which is a major challenge to RLHF in general), such as formal rhyming poetry; rhyming poetry becomes highly rewarded by the preference model because of this bias, and because the preference model doesn’t understand what rhyming is in general, it can only reward rhymes that the base model has already memorized, so, the final model maximizes rhyming only within the set of memorized rhymes, leading to our observations—models which initially seem like amazing poets but are unable to write anything but rhymes, even when explicitly instructed, unable to write in different styles, always horribly bland, frequently jamming in positive moralizing unasked for, etc.
      
      You immediately see why AIHF would not produce mode collapse for rhyming, or many other things: there’s no reason that any of the ‘red team’ or self-generated text would involve poetry, and if it did, the ‘principles’ would be neutral about said poetry. (There is nothing in the UN Declaration of Human Rights saying that most contemporary non-rhyming poetry constitutes a crime against humanity, even if arguably it is.) So, AIHF should leave rhyming alone, preserving the base model’s capabilities intact and showing what models at that scale can really do.
      
      This has motivated me to get around to signing up for Claude. It’s so depressing to punch in a prompt to GPT-4 which ought to be hilarious and creative, and then no matter what the prompt is, out comes a highschool essay in 4 paragraphs which ends on an uplifting note.
- paulfchristiano 20 Feb 2023 2:10 UTC
  LW: 2 AF: 2
  1
  AF Parent
  This is a great example. The rhyming find in particular is really interesting though I’d love to see it documented more clearly if anyone has done that.
  I strongly suspect that whatever level of non-myopic token prediction a base GPT-3 model does, the tuned ones are doing more of it.
  My guess would be that it’s doing basically the same amount of cognitive work looking for plausible completions, but that it’s upweighting that signal a lot.
  Suppose the model always looks ahead and identifies some plausible trajectories based on global coherence. During generative training it only slightly increases the first word of each of those plausible trajectories, since most of the time the text won’t go in the particular coherent direction that the model was able to foresee. But after fine-tuning it learns to focus really hard on the concrete plausible trajectories it found, since those systematically get a high reward.
  Either way, I think this more general phenomenon is a sufficiently specific hypothesis about the effect of RLHF that it could be tested directly and that would be really interesting and valuable. It can also be done using only API access which is great.
  - gwern 20 Feb 2023 2:29 UTC
    LW: 4 AF: 2
    0
    AF Parent
    Hm… It might be hard to distinguish between ‘it is devoting more capacity to implicitly plan rhyming better and that is why it can choose a valid rhyme’ and ‘it is putting more weight on the “same” amount of rhyme-planning and just reducing contribution from valid non-rhyme completions (such as ending the poem and adding a text commentary about it, or starting a new poem, which are common in the base models) to always choose a valid rhyme’, particularly given that it may be mode-collapsing onto the most confident rhymes, distorting the pseudo “log probs” even further. The RL model might be doing more planning internally but then picking only one safest rhyme, so you can’t read off anything from the logprobs, I don’t think. I’m also not sure if you can infer any degree of planning by, say, giving it a half-written line and seeing how badly it screws up… And you can’t build a search tree to quantify it nicely as ‘how much do I need to expand the tree to get a valid rhyme’ because LM search trees are full of degeneracy and loops and most of it is off-policy so it would again be hard to tell what anything meant: the RL model is never used with tree search in any way and anywhere besides the argmax choice, it’s now off-policy and it was never supposed to go there and perf may be arbitrarily bad because it learned to choose while assuming always being on-policy. Hard.
    
    This might be a good test-case or goal for interpretability research: “can you tell me if this model is doing more planning [of rhymes] than another similar model?”
- cubefox 31 Jan 2023 20:10 UTC
  1 point
  0
  Parent
  Sorry, what do you mean with davinci-002 and davinci-003 here?
  
  My impression from reading this overview is that there are two large base models, davinci (GPT-3) and code-davinci-003 (GPT-3.5). There are also the fine-tuned models text-davinci-002 and text-davinci-003, both based on code-davinci-003, but the former trained only with SL, the latter (additionally?) with RL. And then there is an unnamed ChatGPT model which is apparently most similar to text-davinci-003.
- reallyeli 5 Mar 2023 22:21 UTC
  0 points
  0
  Parent
  What specific rhyme-related tasks are you saying ChatGPT can’t do? I tried it on some unusual words and it got a bunch of things right, made a few weird mistakes, but didn’t give me the impression that it was totally unable to rhyme unusual words.
  - gwern 6 Mar 2023 2:47 UTC
    5 points
    1
    Parent
    No, you’re doing it wrong, as I already explained. You’re letting GPT fall back onto its policy by choosing any response. You need to force it out of its comfort zone—force it off-policy, off the safe conservative path. Ask it to explain a pun it did not write, or answer questions like whether a pair of words that you picked rhyme. Write pairs of new words that have never been seen before, etc. The task of ‘come up with a memorized rhyme for reasonably common words’ does not disprove extensive memorization or show that it has failed to understand the underlying phonetics.