gwern comments on The Last Laugh: Exploring the Role of Humor as a Benchmark for Large Language Models

gwern 13 Feb 2024 1:54 UTC
5 points
1
The examples here mostly seem to demonstrate what I observed way back in mid-2020 and observations since about the effects of RLHFing (or other forms of preference learning/instruction-tuning): GPT-3-base is actually quite funny and good at satire, but due to BPEs, it is fundamentally unable to understand puns or anything related to phonetics, and because it is unable to understand what it doesn’t understand, when forced to make pun or ‘dad jokes’, it will either deploy a safely-memorized one or confabulate a random choice (because, unable to see the phonetic logic, these jokes all look like anti-jokes - ‘as far as I can see from the BPE-encoded text, the joke is that there is no joke, just a random unrelated punchline, and so I will predict a random punchline as well’). RLHF then badly damages its satire and other kinds of opinionated humor, while exacerbating the conservatism & memorization. (See also: rhyming poetry.)

This mostly describes your sloths. ‘Sloths’ are a bad testcase because there are so many jokes about them already (I would not expect that topic to elicit novel jokes from pretty much any LLM...), but you can see that where the responses do not look memorized, the best ones avoid phonetics. For example, the best one is probably
1. Surreal: A sloth wakes up in the middle of the rainforest and screams, “Oh no! I’m late for my 9:02 nap!” He frantically scurries… well, shuffles… as fast as he can, finally arriving at his hammock at 9:03. Panting, he whispers, “Just made it.”
Which is almost good! And also completely avoids anything whatsoever to do with phonetics or puns.

If you want to pursue this topic any further, you should probably do a lot more reading about RLHF and the pathologies of GPT-3.5/4 models and how they differ from base models, and put a lot more effort into benchmarking base models instead of anything RLHF/instruction-tuned. Benchmarking GPT-3.5/4/Bard/LLama-70b, as you do, is a waste of everyone’s time, unless you are comparing to proper base models (with decent prompts, and best-of sampling) and so the comparison at least demonstrates the damage from the safety measures.

It is unfortunate that the study of creative writing with LLMs right now is actually ‘the study of how RLHF fails at creative writing’, but it is presented as something entirely else… The world definitely does not need more people running around saying they have ‘benchmarked large language models on X’ where ‘X is actually RLHF but they just don’t realize it’, because they grabbed the easiest off-the-shelf models and assumed that was the best possible performance (as opposed to a lower bound and a rather loose one at that).

You could try Claude-2 but its creative writing seems to be degrading over time as they continue to train it, and the chatbot arena Elos for Claudes get worse over time, consistent with that. I haven’t had a chance to check it (seems to be a waitlist), but Weaver claims their finetuned LLaMA-base “Weaver-Ultra” model is way better at creative writing than GPT-4.