gwern comments on Mysteries of mode collapse

gwern 8 Jun 2023 21:32 UTC
LW: 5 AF: 3
0
AF
Apparently there is also mode collapse on jokes: https://arxiv.org/abs/2306.04563

...In a series of exploratory experiments around jokes, i.e., generation, explanation, and detection, we seek to understand ChatGPT’s capability to grasp and reproduce human humor. Since the model itself is not accessible, we applied prompt-based experiments. Our empirical evidence indicates that jokes are not hard-coded but mostly also not newly generated by the model. Over 90% of 1008 generated jokes were the same 25 Jokes. The system accurately explains valid jokes but also comes up with fictional explanations for invalid jokes. Joke-typical characteristics can mislead ChatGPT in the classification of jokes.

I explain this the same way. GPT-3.5/4 cannot understand jokes in general because it is blinded to phonetics by the BPE tokenization, so many jokes look like non sequiturs or ‘anti-humor’, even though they are not, and GPT cannot explain or understand them (and if it can’t understand why a joke is correct it can’t understand why it’s incorrect either); hence, it is safest during RL training on a dataset with a small number of human-whitelisted jokes (the reward model not being any better able to understand what a joke is as it is just another BPE-tokenized GPT model) to mode-collapse onto a handful of memorized jokes which it is sure are jokes*, and just assume that anything presented to it in a joke format is a joke & confabulate appropriately (just as davinci-001 was unable to explain puns but would make up a dozen different explanations).

* Remember, there is no ‘diversity bonus’ in RLHF, no reward for novelty, or avoiding repetition dataset-wide. Each conversation or datapoint is evaluated in isolation. There is no penalty to telling the same knock-knock joke every time a user asks for a knock-knock joke, if that particular knock-knock joke is, however so slightly, the best knock-knock joke the model knows. It could only learn to avoid telling the same joke twice in a conversation, assuming the full transcript was being used, but there is no way in the standard RLHF setup to try to pick randomized strategies or maximize diversity/exploration/novelty.