Yeah. Regarding Dall-E 3 specifically: Few people know that there was an unnamed model between Dall-E 2 and 3: “Dall-E 2.5”, as I like to call it. It was initially used in Microsoft’s Bing Image Creator. It often produced surprisingly good aesthetics, especially with very short or unspecific prompts.
Then it got replaced with Dall-E 3, which produces substantially fewer visual errors (like e.g. too many limbs), and it has a much better complex prompt following ability, but its style is also way worse than 2.5. Like apparently most other text-to-image models, Dall-E 3 largely produces tacky, aesthetically worthless kitsch. RIP Dall-E 2.5, which is now completely unavailable. :(
A few comparisons using old images I did a while ago:
Somewhat relatedly: when I started this post I planned to argue you should use midjourney instead of DallE, but then when I went to test it I found that Midjourney had become more generic in some way that was hard to place. I think it’s still better than DallE default but not a slam dunk
I decided to try out cubefox’s prompts on current Midjourney to give a sense of it
It actually did pretty well. I think my previous “hrm, it doesn’t seem as good” was when I was specifically trying to get it to generate images for LessWrong posts, where it seemed to be defaulting to very generic landscapes (or generic women’s faces). I’ll try a round of those on recent curated posts.
I tried again with a recent curated post, just putting in the title. (historically, practice I usually start with the post title and see if that outputs anything interesting, and then started exploring other ideas based on my understanding of the post and what visuals I thought would be good. But, this is pretty time consuming, so seeing what the “one shot” result is is useful).
Here’s “Liability regimes for AI, watercolor painting”, from the latest Midjourney (v6)
for contrast, here’s what Midjourney 4 resulted in:
(I kinda like the Lady Justice optin)
Midjourney v3
v2
Midjourney v3 feels like hit some kind of sweet spot of “relatively high quality” but also “a bit more weird/quirky/dreamlike.”
I went and tried to do Midjourney v3, this time putting in more “quality” words since it isn’t automatically trained on aesthetics as hard:
”Liability regimes for AI”, beautiful aquarelle painting by Thomas Schaller, high res:
okay now I guess I’ll just wander back up the chain of versions, this time using my Metis on what kinds of prompts will get better results rather than seeing how it handles the dumbest case:
“Liability regimes for AI”, beautiful aquarelle painting by Thomas Schaller, high res –– Midjourney V4
“Liability regimes for AI”, beautiful aquarelle painting by Thomas Schaller, high res—MJ version 6
I preferred your v4 outputs for both prompts. They seem substantially more evocative of the subject matter than v6 while looking substantially better than v3, IMO.
(This was a pretty abstract prompt, though, which I imagine poses difficulty?)
Oh, so current Midjourney is actually far better than Dall-E 3 in terms of aesthetics. One thing I still liked about the old Dall-E 2.5 was its relatively strong tendency towards photorealism. Because arguably “a cat reading a book” describes an image of an actual cat reading a book, rather than an image of an illustration of a cat reading a book, or some mixture of photo and Pixar caricature, as in the case of Dall-E 3. Though this could probably be adjusted with adding “a photo of”.
I guess the bad aesthetics are to some extent a side effect of some training/fine-tuning step that improves some other metric (like prompt following), and they don’t have a person who knows/cares about art enough to block “improvements” with such side effects.
In case of Dall-E the history was something like this: Dall-E 1/2: No real style, generations did look presumably like an average prediction from the training sample, e.g. like a result from Google Images. Dall-E 2.5: Mostly good aesthetics, e.g. portraits tend to have dramatic lighting and contrast. Dall-E 3: Very tacky aesthetics, probably not intentional but a side effect from something else.
So I guess the bad style would be in general a mostly solvable problem (or one which could be weighed against other metrics), if the responsible people are even aware there is a problem. Which they might not be, given that they probably have a background in CS rather than art.
I guess the bad aesthetics are to some extent a side effect of some training/fine-tuning step that improves some other metric (like prompt following), and they don’t have a person who knows/cares about art enough to block “improvements” with such side effects.
Also probably a lot of it is just mode collapse from simple preference learning optimization. Each of your comparisons shows a daring, risky choice which a rater might not prefer, vs a very bland, neutral, obvious, colorful output. A lot of the image generations gains are illusory, and caused simply by a mode-collapse down onto a few well-rated points:
Our experiments suggest that realism and consistency can both be improved simultaneously; however, there exists a clear tradeoff between realism/consistency and diversity. By looking at Pareto optimal points, we note that earlier models are better at representation diversity and worse in consistency/realism, and more recent models excel in consistency/realism while decreasing the representation diversity.
Same problem as tuning LLMs. It’s a sugar-rush, like spending Mickey Mouse bucks at Disney World: it gives you the illusion of progress and feels like it’s free, but in reality you’ve paid for every ‘gain’.
I found that Midjourney had become more generic in some way that was hard to place.
What you can try doing is enabling the personalization (or use mine), to drag it away from the generic MJ look, and then experimenting with the chaos sampling option to find something more interesting you can then work with & vary.
I’m told (by the ‘simple parameters’ section of this guide, which I have not had the opportunity to test but which to my layperson’s eye seems promisingly mechanistic in approach) that adjusting the stylize parameter to numbers lower than its default 100 turns down the midjourney-house-style effect (at the cost of sometimes tending to make things more collage-y and incoherent as values get lower), and that increasing the weird parameter above its default 0 will effectively push things to be unlike the default style (more or less).
We’ve talked (a little) about integrating Flux more into LW, to make it easier to make good images. (maybe with a soft-nudge towards using “LessWrong watercolor style” by default if you don’t specify something else),
Although something habryka brought up is a lot of people’s images seem to be coming from substack, which has it’s own (bad) version of it.
Yeah. Regarding Dall-E 3 specifically: Few people know that there was an unnamed model between Dall-E 2 and 3: “Dall-E 2.5”, as I like to call it. It was initially used in Microsoft’s Bing Image Creator. It often produced surprisingly good aesthetics, especially with very short or unspecific prompts.
Then it got replaced with Dall-E 3, which produces substantially fewer visual errors (like e.g. too many limbs), and it has a much better complex prompt following ability, but its style is also way worse than 2.5. Like apparently most other text-to-image models, Dall-E 3 largely produces tacky, aesthetically worthless kitsch. RIP Dall-E 2.5, which is now completely unavailable. :(
A few comparisons using old images I did a while ago:
Somewhat relatedly: when I started this post I planned to argue you should use midjourney instead of DallE, but then when I went to test it I found that Midjourney had become more generic in some way that was hard to place. I think it’s still better than DallE default but not a slam dunk
I decided to try out cubefox’s prompts on current Midjourney to give a sense of it
It actually did pretty well. I think my previous “hrm, it doesn’t seem as good” was when I was specifically trying to get it to generate images for LessWrong posts, where it seemed to be defaulting to very generic landscapes (or generic women’s faces). I’ll try a round of those on recent curated posts.
I tried again with a recent curated post, just putting in the title. (historically, practice I usually start with the post title and see if that outputs anything interesting, and then started exploring other ideas based on my understanding of the post and what visuals I thought would be good. But, this is pretty time consuming, so seeing what the “one shot” result is is useful).
Here’s “Liability regimes for AI, watercolor painting”, from the latest Midjourney (v6)
for contrast, here’s what Midjourney 4 resulted in:
(I kinda like the Lady Justice optin)
Midjourney v3
v2
Midjourney v3 feels like hit some kind of sweet spot of “relatively high quality” but also “a bit more weird/quirky/dreamlike.”
I went and tried to do Midjourney v3, this time putting in more “quality” words since it isn’t automatically trained on aesthetics as hard:
”Liability regimes for AI”, beautiful aquarelle painting by Thomas Schaller, high res:
okay now I guess I’ll just wander back up the chain of versions, this time using my Metis on what kinds of prompts will get better results rather than seeing how it handles the dumbest case:
“Liability regimes for AI”, beautiful aquarelle painting by Thomas Schaller, high res –– Midjourney V4
“Liability regimes for AI”, beautiful aquarelle painting by Thomas Schaller, high res—MJ version 6
I preferred your v4 outputs for both prompts. They seem substantially more evocative of the subject matter than v6 while looking substantially better than v3, IMO.
(This was a pretty abstract prompt, though, which I imagine poses difficulty?)
I am not an artist.
Oh, so current Midjourney is actually far better than Dall-E 3 in terms of aesthetics. One thing I still liked about the old Dall-E 2.5 was its relatively strong tendency towards photorealism. Because arguably “a cat reading a book” describes an image of an actual cat reading a book, rather than an image of an illustration of a cat reading a book, or some mixture of photo and Pixar caricature, as in the case of Dall-E 3. Though this could probably be adjusted with adding “a photo of”.
I guess the bad aesthetics are to some extent a side effect of some training/fine-tuning step that improves some other metric (like prompt following), and they don’t have a person who knows/cares about art enough to block “improvements” with such side effects.
In case of Dall-E the history was something like this: Dall-E 1/2: No real style, generations did look presumably like an average prediction from the training sample, e.g. like a result from Google Images. Dall-E 2.5: Mostly good aesthetics, e.g. portraits tend to have dramatic lighting and contrast. Dall-E 3: Very tacky aesthetics, probably not intentional but a side effect from something else.
So I guess the bad style would be in general a mostly solvable problem (or one which could be weighed against other metrics), if the responsible people are even aware there is a problem. Which they might not be, given that they probably have a background in CS rather than art.
Also probably a lot of it is just mode collapse from simple preference learning optimization. Each of your comparisons shows a daring, risky choice which a rater might not prefer, vs a very bland, neutral, obvious, colorful output. A lot of the image generations gains are illusory, and caused simply by a mode-collapse down onto a few well-rated points:
Same problem as tuning LLMs. It’s a sugar-rush, like spending Mickey Mouse bucks at Disney World: it gives you the illusion of progress and feels like it’s free, but in reality you’ve paid for every ‘gain’.
What you can try doing is enabling the personalization (or use mine), to drag it away from the generic MJ look, and then experimenting with the
chaos
sampling option to find something more interesting you can then work with & vary.I’m told (by the ‘simple parameters’ section of this guide, which I have not had the opportunity to test but which to my layperson’s eye seems promisingly mechanistic in approach) that adjusting the
stylize
parameter to numbers lower than its default 100 turns down the midjourney-house-style effect (at the cost of sometimes tending to make things more collage-y and incoherent as values get lower), and that increasing theweird
parameter above its default 0 will effectively push things to be unlike the default style (more or less).it’s sad that open source models like Flux have a lot of potential for customized workflows and finetuning but few people use them
We’ve talked (a little) about integrating Flux more into LW, to make it easier to make good images. (maybe with a soft-nudge towards using “LessWrong watercolor style” by default if you don’t specify something else),
Although something habryka brought up is a lot of people’s images seem to be coming from substack, which has it’s own (bad) version of it.
Beautiful, thanks for sharing