gwern comments on Mysteries of mode collapse

gwern 25 Jun 2023 2:10 UTC
LW: 7 AF: 5
1
AF

Of course, I’d also expect Claude to be much subtler simply because it’s working off less data and so it’s less likely to have gotten rated text or inputs which would push it towards mode-collapsing on easily-recognized rhyming poetry and to avoid harder-to-understand poetry. (Claude is just the ‘constitutional prompt’ model, right? Hard to see how a list of generic principles would push it towards rhyming-only.)

To elaborate a bit more on this: as Owain notes, Claude is very good at writing poetry & text-style transfer (eg 1, 2, 3), and I really ought to try it more sometime.

Claude uses a variant of RLHF they dub ‘AIHF’. In the classic Christiano RLHF, you take a lot of text data, from anywhere (such as users of an API), and label pairs by which one is better; your GPT model is finetuned to predict it, and then used as an oracle to train another GPT reinforcement-learning-style to maximize the reward from the oracle. In AIHF, you get your text data by instead starting with a do-gooder ‘principles’ prompt, full of things like the Declaration of Independence. and use it to generate your large text dataset, and then do RLHF on that.

In RLHF, by my theory of rhyming mode collapse, what happens is that some OA API users were playing around with poetry (such as myself), and those text samples would be used in comparisons by human raters; these human raters are usually not poetry connoisseurs, and have a bias towards easily-rated poetry (a laziness bias documented in RLHF papers and which is a major challenge to RLHF in general), such as formal rhyming poetry; rhyming poetry becomes highly rewarded by the preference model because of this bias, and because the preference model doesn’t understand what rhyming is in general, it can only reward rhymes that the base model has already memorized, so, the final model maximizes rhyming only within the set of memorized rhymes, leading to our observations—models which initially seem like amazing poets but are unable to write anything but rhymes, even when explicitly instructed, unable to write in different styles, always horribly bland, frequently jamming in positive moralizing unasked for, etc.

You immediately see why AIHF would not produce mode collapse for rhyming, or many other things: there’s no reason that any of the ‘red team’ or self-generated text would involve poetry, and if it did, the ‘principles’ would be neutral about said poetry. (There is nothing in the UN Declaration of Human Rights saying that most contemporary non-rhyming poetry constitutes a crime against humanity, even if arguably it is.) So, AIHF should leave rhyming alone, preserving the base model’s capabilities intact and showing what models at that scale can really do.

This has motivated me to get around to signing up for Claude. It’s so depressing to punch in a prompt to GPT-4 which ought to be hilarious and creative, and then no matter what the prompt is, out comes a highschool essay in 4 paragraphs which ends on an uplifting note.