alexlyzhov comments on What DALL-E 2 can and cannot do

alexlyzhov 4 Jul 2022 20:25 UTC
1 point
I wonder what happens when you ask it to generate
> “in the style of a popular modern artist <unknown name>”
or
> “in the style of <random word stem>ism”.
You could generate both types of prompts with GPT-3 if you wanted so it would be a complete pipeline.
“Generate conditioned on the new style description” may be ready to be used even if “generate conditioned on an instruction to generate something new” is not. This is why a decomposition into new style description + image conditioned on it seems useful.
If this is successful, then more of the high-level idea generation involved can be shifted onto a language model by letting it output a style description. Leave blanks in it and run it for each blank, while ensuring generations form a coherent story.
>”<new style name>, sometimes referred to as <shortened version>, is a style of design, visual arts, <another area>, <another area> that first appeared in <country> after <event>. It influenced the design of <objects>, <objects>, <more objects>. <new style name> combined <combinatorial style characteristic> and <another style characteristic>. During its heyday, it represented <area of human life>, <emotion>, <emotion> and <attitude> towards <event>.”
DALL-E can already model the distribution of possible contexts (image backgrounds, other objects, states of the object) + possible prompt meanings. An go from the description 1) to high-level concepts, 2) to ideas for implementing these concepts (relative placement of objects, ideas for how to merge concepts), 3) to low-level details. All within 1 forward pass, for all prompts! This is what astonished me most about DALL-E 1.
Importantly, placing, implementing, and combining concepts in a picture is done in a novel way without a provided specification. For style generation, it would need to model a distribution over all possible styles and use each style, all without a style specification. This doesn’t seem much harder to me and could probably be achieved with slightly different training. The procedure I described is just supposed to introduce helpful stochasticity in the prompt and use an established generation conduit.