I guess the bad aesthetics are to some extent a side effect of some training/fine-tuning step that improves some other metric (like prompt following), and they don’t have a person who knows/cares about art enough to block “improvements” with such side effects.
In case of Dall-E the history was something like this: Dall-E 1/2: No real style, generations did look presumably like an average prediction from the training sample, e.g. like a result from Google Images. Dall-E 2.5: Mostly good aesthetics, e.g. portraits tend to have dramatic lighting and contrast. Dall-E 3: Very tacky aesthetics, probably not intentional but a side effect from something else.
So I guess the bad style would be in general a mostly solvable problem (or one which could be weighed against other metrics), if the responsible people are even aware there is a problem. Which they might not be, given that they probably have a background in CS rather than art.
I guess the bad aesthetics are to some extent a side effect of some training/fine-tuning step that improves some other metric (like prompt following), and they don’t have a person who knows/cares about art enough to block “improvements” with such side effects.
Also probably a lot of it is just mode collapse from simple preference learning optimization. Each of your comparisons shows a daring, risky choice which a rater might not prefer, vs a very bland, neutral, obvious, colorful output. A lot of the image generations gains are illusory, and caused simply by a mode-collapse down onto a few well-rated points:
Our experiments suggest that realism and consistency can both be improved simultaneously; however, there exists a clear tradeoff between realism/consistency and diversity. By looking at Pareto optimal points, we note that earlier models are better at representation diversity and worse in consistency/realism, and more recent models excel in consistency/realism while decreasing the representation diversity.
Same problem as tuning LLMs. It’s a sugar-rush, like spending Mickey Mouse bucks at Disney World: it gives you the illusion of progress and feels like it’s free, but in reality you’ve paid for every ‘gain’.
I guess the bad aesthetics are to some extent a side effect of some training/fine-tuning step that improves some other metric (like prompt following), and they don’t have a person who knows/cares about art enough to block “improvements” with such side effects.
In case of Dall-E the history was something like this: Dall-E 1/2: No real style, generations did look presumably like an average prediction from the training sample, e.g. like a result from Google Images. Dall-E 2.5: Mostly good aesthetics, e.g. portraits tend to have dramatic lighting and contrast. Dall-E 3: Very tacky aesthetics, probably not intentional but a side effect from something else.
So I guess the bad style would be in general a mostly solvable problem (or one which could be weighed against other metrics), if the responsible people are even aware there is a problem. Which they might not be, given that they probably have a background in CS rather than art.
Also probably a lot of it is just mode collapse from simple preference learning optimization. Each of your comparisons shows a daring, risky choice which a rater might not prefer, vs a very bland, neutral, obvious, colorful output. A lot of the image generations gains are illusory, and caused simply by a mode-collapse down onto a few well-rated points:
Same problem as tuning LLMs. It’s a sugar-rush, like spending Mickey Mouse bucks at Disney World: it gives you the illusion of progress and feels like it’s free, but in reality you’ve paid for every ‘gain’.