gwern comments on Ideas for benchmarking LLM creativity

gwern 18 Dec 2024 19:17 UTC
6 points
2

but if trained well such models’ idea of aesthetic quality is at least pretty close to most human judgements

That does not follow. Preference learning involves almost no learning of preferences. A suit cut to fit all may wind up fitting none—particularly for high-dimensional things under heavy optimization, like, say, esthetics, where you want to apply a lot of selection pressure to get samples which are easily 1-in-10,000 or rarer, and so ‘the tails come apart’.

(How much variance is explained by individual differences in preference learning settings like comparing image generators? A great question! And you’ll find that hardly any one has any idea. As it happens, I asked the developer of a major new image generator this exact question last night, and not only did he have no idea, it looked like it had never even occurred to him to wonder what the performance ceiling without personalization could be or to what extent all of the expensive ratings they were paying for reflected individual rater preferences rather than some ‘objective’ quality or if they were even properly preserving such metadata rather than, like it seems many tuning datasets do, throwing it out as ‘unnecessary’.)

but if trained well such models’ idea of aesthetic quality is at least pretty close to most human judgements....Then you just need a large number of high quality human judgements from a representative cross-section of people with good taste in poetry/prose/fiction: hiring professional human editors or literary talent scouts seems like a good idea. One of the good things about foundation model sizes and training costs going up is that reasonable budgets for fine-tuning should also increase proportionately.

No. This is fundamentally wrong and what is already being done and what I am criticizing. There is no single ‘taste’ or ‘quality’. Individual differences are real.{{citation needed}} People have different preferences.{{citation needed}} No change in the ‘cross-section’ changes that (unless you reduce the ‘people’ down to 1 person, the current user). All you are doing is again optimizing for the lowest common denominator. Changing the denominator population doesn’t change that.

Seriously, imagine applying this logic anywhere else, like food!

Another option would be to train or fine-tune the quality scoring model used for the RL on literary sources (books, poetry, etc) with quality labels drawn from relatively objective existing data, like total sales, literary awards, critical rankings, reviews from good reviewers, and so forth...So the obvious approach for finer-grained style control would be to train or fine-tune on a training set of a large number documents each of which consists of a prompt-like description/review/multiple reviews of a literary work, giving a variety of different types of aesthetic opinions and objective measures of its quality, followed by the corresponding literary work itself.

Conditioning won’t change the mode collapse, except as you are smuggling in individuals by the backdoor like developing an implicit model of individual reviewers’ preferences. (In which case, far better to just condition on all individuals...)

and generally optimizing such things too hard leads to sameness …The RLHF approach only trains a single aesthetic, and probably shouldn’t be taken too far or optimized too hard

Well, yes, that’s the problem. It has been taken too far and optimized too hard for a single quality score, and that’s where we are now already. How do we provide better benchmarks where optimizing harder won’t just worsen the problem?