RogerDearnaley comments on Ideas for benchmarking LLM creativity

RogerDearnaley 18 Dec 2024 8:38 UTC
3 points
0
People who train text-to-image generative models have had a good deal of success with training (given a large enough and well-enough human-labeled training set) an “aesthetic quality” scoring model, and then training a generative image model to have “high aesthetic quality score” as a text label. Yes, doing things like this can produces effects like the recognizable Midjourney aesthetic, which can be flawed, and generally optimizing such things too hard leads to sameness — but if trained well such models’ idea of aesthetic quality is at least pretty close to most human judgements. Presumably what can be done for images can also be done for prose, poetry, or fiction as well.
There isn’t a direct equivalent of that approach for an LLM, but RLHF seems like a fairly close equivalent. So far people have primarily used RLHF for “how good is the answer to my question?” Adapting a similar approach to “how high quality is the poetry/prose/fiction produced by the model?” is obviously feasible. Then you just need a large number of high quality human judgements from a representative cross-section of people with good taste in poetry/prose/fiction: hiring professional human editors or literary talent scouts seems like a good idea. One of the good things about foundation model sizes and training costs going up is that reasonable budgets for fine-tuning should also increase proportionately.
Another option would be to train or fine-tune the quality scoring model used for the RL on literary sources (books, poetry, etc) with quality labels drawn from relatively objective existing data, like total sales, literary awards, critical rankings, reviews from good reviewers, and so forth.
The RLHF approach only trains a single aesthetic, and probably shouldn’t be taken too far or optimized too hard: while there is some widespread agreement about what prose is good vs, dreadful, finer details of taste vary, and should do so. So the obvious approach for finer-grained style control would be to train or fine-tune on a training set of a large number documents each of which consists of a prompt-like description/review/multiple reviews of a literary work, giving a variety of different types of aesthetic opinions and objective measures of its quality, followed by the corresponding literary work itself.
These ideas have been phrased as model-post-training suggestions, but turning these into a benchmark is also feasible: the “Aesthetic quality scoring model” from the RLHF approach is in itself a benchmark, and the “prompt containing reviews and statistics → literary work” approach could also be inverted to instead train a reviewer model to review literary works from various different aesthetic viewpoints, and estimate their likely sales/critical reception.
- gwern 18 Dec 2024 19:17 UTC
  7 points
  3
  Parent
  
  but if trained well such models’ idea of aesthetic quality is at least pretty close to most human judgements
  
  That does not follow. Preference learning involves almost no learning of preferences. A suit cut to fit all may wind up fitting none—particularly for high-dimensional things under heavy optimization, like, say, esthetics, where you want to apply a lot of selection pressure to get samples which are easily 1-in-10,000 or rarer, and so ‘the tails come apart’.
  
  (How much variance is explained by individual differences in preference learning settings like comparing image generators? A great question! And you’ll find that hardly any one has any idea. As it happens, I asked the developer of a major new image generator this exact question last night, and not only did he have no idea, it looked like it had never even occurred to him to wonder what the performance ceiling without personalization could be or to what extent all of the expensive ratings they were paying for reflected individual rater preferences rather than some ‘objective’ quality or if they were even properly preserving such metadata rather than, like it seems many tuning datasets do, throwing it out as ‘unnecessary’. EDIT: the higher-up of a LLM company I asked the next night likewise)
  
  but if trained well such models’ idea of aesthetic quality is at least pretty close to most human judgements....Then you just need a large number of high quality human judgements from a representative cross-section of people with good taste in poetry/prose/fiction: hiring professional human editors or literary talent scouts seems like a good idea. One of the good things about foundation model sizes and training costs going up is that reasonable budgets for fine-tuning should also increase proportionately.
  
  No. This is fundamentally wrong and what is already being done and what I am criticizing. There is no single ‘taste’ or ‘quality’. Individual differences are real.{{citation needed}} People like different things, and have different preferences.{{citation needed}} No change in the ‘cross-section’ changes that (unless you reduce the ‘people’ down to 1 person, the current user). All you are doing is again optimizing for the lowest common denominator. Changing the denominator population doesn’t change that.
  
  Seriously, imagine applying this logic anywhere else, like food! “There is 1 objective measure of food quality. The ideal food is a McDonald’s Big Mac. You may not like it, but this is what peak food performance is. The Science Has Spoken.”
  
  Another option would be to train or fine-tune the quality scoring model used for the RL on literary sources (books, poetry, etc) with quality labels drawn from relatively objective existing data, like total sales, literary awards, critical rankings, reviews from good reviewers, and so forth...So the obvious approach for finer-grained style control would be to train or fine-tune on a training set of a large number documents each of which consists of a prompt-like description/review/multiple reviews of a literary work, giving a variety of different types of aesthetic opinions and objective measures of its quality, followed by the corresponding literary work itself.
  
  Conditioning won’t change the mode collapse, except as you are smuggling in individuals by the backdoor like developing an implicit model of individual reviewers’ preferences.* (In which case, far better to just condition on all individuals...)
  
  and generally optimizing such things too hard leads to sameness …The RLHF approach only trains a single aesthetic, and probably shouldn’t be taken too far or optimized too hard
  
  Well, yes, that’s the problem. It has been taken too far and optimized too hard for a single quality score, and that’s where we are now already. How do we provide better benchmarks where optimizing harder won’t just worsen the problem?
  
  * A funny implication is that individuals known by name to AI models, like myself, could actually be getting superior results because of it! The pretraining grants them a model of the individual preferences, through truesight, if nothing else, and they simply generalize the post-training maximization (since we know they can surprisingly generalize unrelated training data). So, it could be that I get better results from LLMs (but not image generators) because there’s enough text to deduce that ‘gwern’ is asking, and tailor the results to what they’ve learned from all my writings like gwernnet. You could also try to manufacture this deliberately: start labeling image/text explicitly on a webpage somewhere, prefixing “$NAME likes/dislikes this: $DATA”, and then including “$NAME” in prompts.