gwern comments on What DALL-E 2 can and cannot do

gwern 12 Jun 2022 23:51 UTC
32 points
Can DALL·E Create New Styles?

Most DALL·E questions can be answered by just reading the paper of it or its competitors, or are dumb. This is probably the most interesting question that can’t be, and also one of the most common: can DALL·E (which we’ll use just as a generic representative of image generative models, since no one argues that one arch or model can and the others cannot AFAIK) invent a new style? DALL·E is, like GPT-3 in text, admittedly an incredible mimic of many styles, and appears to have gone well beyond any mere ‘memorization’ of the images depicting styles because it can so seamlessly insert random objects into arbitrary styles (hence all the “Kermit Through The Ages” or “Mughal space rocket” variants); but simply being a gifted epigone of most existing styles is not guarantee you can create a new one.

If we asked a Martian what ‘style’ was, it would probably conclude that “‘style’ is what you call it when some especially mentally-ill humans output the same mistakes for so long that other humans wearing nooses try to hide the defective output by throwing small pieces of green paper at the outputs, and a third group of humans wearing dresses try to exchange large white pieces of paper with black marks on them for the smaller green papers”.

Not the best definition, but it does provide one answer: since DALL·E is just a blob of binary which gets run on a GPU, it is incapable of inventing a style because it can’t take credit for it or get paid for it or ally with gallerists and journalists to create a new fashion, so the pragmatic answer is just ‘no’, no more than your visual cortex could. So, no. This is unsatisfactory, however, because it just punts to, ‘could humans create a new style with DALL·E?’ and then the answer to that is simply, ‘yes, why not? Art has no rules these days: if you can get someone to pay millions for a rotting shark or half a mill for a blurry DCGAN portrait, we sure as heck can’t rule out someone taking some DALL·E output and getting paid for it.’ After all, DALL·E won’t complain (again, no more than your visual cortex would). Also unsatisfactory but it is at least testable: has anyone gotten paid yet? (Of course artists will usually try to minimize or lie about it to protect their trade secrets, but at some point someone will ’fess up or it become obvious.) So, yes.

Let’s take ‘style’ to be some principled, real, systematic visual system of esthetics. Regular use of DALL·E, of course, would not produce a new style: what would be the name of this style in the prompt? “Unnamed new style”? Obviously, if you prompt DALL·E for “a night full of stars, Impressionism”, you will get what you ask for. What are the Internet-scraped image/text caption pairs which would correspond to the creation of a new style, exactly? “A dazzling image of an unnamed new style being born | Artstation | digital painting”? There may well be actual image captions out there which do say something like that, but surely far too few to induce some sort of zero-shot new-style creation ability. Humans too would struggle with such an instruction. (Although it’s fun to imagine trying to commission that from a human artist on Fiverr for $50, say: “an image of a cute cat, in a totally brand-new never before seen style.” “A what?” “A new style.” “I’m really best at anime-style illustrations, you know.” “I know. Still, I’d like ‘a brand new style’. Also, I’d like to commission a second one after that too, same prompt.” ”...would you like a refund?”)

Still, perhaps DALL·E might invent a new style anyway just as part of normal random sampling? Surely if you generated enough images it’d eventually output something novel? However, DALL·E isn’t trying to do so, it is ‘trying’ to do something closer to generating the single most plausible image for a given text input, or to some minor degree, sampling from the posterior distribution of the Internet images + commercial licensed image dataset it was trained on. To the extent that a new style is possible, it ought to be extremely rare, because it is not, in fact, in the training data distribution (by definition, it’s novel), and even if DALL·E 2 ‘mistakenly’ does so, it would be true that this newborn style would be extremely rare because it is so unpopular compared to all the popular styles: 1 in millions or billions.

Let’s say it defied the odds and did anyway, since OA has generated millions of DALL·E 2 samples already according to their PR. ‘Style’ is something of a unicorn: if DALL·E could (or had already) invented a new style… how would we know? If Impressionism had never existed and Van Gogh’s Starry Night flashed up on the screen of a DALL·E 2 user late one night, they would probably go ‘huh, weird blobby effect, not sure I like it’ and then generate new completions—rather than herald it as the ultimate exemplar of a major style and destined to be one of the most popular (to the point of kitsch).

Finally, if someone did seize on a sample from a style-less prompt because it looked new to them and wanted to generate more, they would be out of luck: DALL·E 2 can generate variations on an image, yes, but this unavoidably is a mashup of all of the content and style and details in an image. There is not really any native way to say ‘take the cool new style of this image and apply it to another’. You are stuck with hacks: you can try shrinking the image to uncrop, or halve it and paste in a target image to infill, or you can go outside DALL·E 2 entirely and use it in a standard style-transfer NN as the original style image… But there is no way to extract the ‘style’ as an easily reused keyword or tool the way you can apply ‘Impressionism’ to any prompt.

This is a bad situation. You can’t ask for a new style by name because it has none; you can’t talk about it without naming it because no one does that for new real-world styles, they name it; and if you don’t talk about it, a new style has vanishingly low odds of being generated, and you wouldn’t recognize it, nor could you make any good use of it if you did. So, no.

DALL·E might be perfectly capable of creating a new style in some sense, but the interface renders this utterly opaque, hidden dark knowledge. We can be pretty sure that DALL·E knows styles as styles rather than some mashup of physical objects/colors/shapes: just like large language models imitate or can be prompted to be more or less rude, more or less accurate, more or less calibrated, generate more or less buggy or insecure code, etc., large image models disentangle and learn pretty cool generic capabilities: not just individual styles, but ‘award-winning’ or ‘trending on Artstation’ or ‘drawn by an amateur’. Further, we can point to things like style transfer: you can use a VGG CNN trained solely on ImageNet, with near-zero artwork in it (and definitely not a lot of Impressionist paintings), to fairly convincingly stylize images in the style of “Starry Night”—VGG has never seen “Starry Night”, and may never have seen a painting, period, so how does it do this?

Where DALL·E knows about styles is in its latent space (or VGG’s Gram matrix embedding): the latent space is an incredibly powerful way to boil down images, and manipulation of the latent space can go beyond ordinary samples to make, say, a face StyleGAN generate cars or cats instead—there’s a latent for that. Even things which seem to require ‘extrapolation’ are still ‘in’ the capacious latent space somewhere, and probably not even that far away: in very high dimensional spaces, everything is ‘interpolation’ because everything is an ‘outlier’; why should a ‘new style’ be all that far away from the latent points corresponding to well-known styles?

All text prompts and variations are just hamfisted ways of manipulating the latent space. The text prompt is just there to be encoded by CLIP into a latent space. The latent space is what encodes the knowledge of the model, and if we can manipulate the latent space, we can unlock all sorts of capabilities like in face GANs, where you can find latent variables which correspond to, say, wearing eyeglasses or smiling vs frowning—no need to mess around with trying to use CLIP to guide a ‘smile’ prompt if you can just tweak the knob directly.

Unless, of course, you can’t tweak the knob directly, because it’s behind an API and you have no way of getting or setting the embedding, much less doing gradient ascent. Yeah, then you’re boned. So the answer here becomes, ‘no, for now: DALL·E 2 can’t in practice because you can’t use it in the necessary way, but when some equivalent model gets released, then it becomes possible (growth mindset!).’

Let’s say we have that model, because it surely won’t be too long before one gets released publicly, maybe a year or two at the most. And public models like DALL·E Mini might be good enough already. How would we go about it concretely?
- ‘Copying style embedding’ features alone would be a big boost: if you could at least cut out and save the style part of an embedding and use it for future prompts/editing, then when you found something you liked, you could keep it.
- ‘Novelty search’ has a long history in evolutionary computation, and offers a lot of different approaches. Defining ‘fitness’ or ‘novelty’ is a big problem here, but the models themselves can be used for that: novelty as compared against the data embeddings, optimizing the score of a large ensemble of randomly-initialized NNs (see also my recent essay on constrained optimization as esthetics) or NNs trained on subsets (such as specific art movements, to see what ‘hyper Impressionism’ looks like) or...
- Preference-learning reinforcement learning is a standard approach: try to train novelty generation directly. DRL is always hard though.
- One approach worth looking at is “CAN: Creative Adversarial Networks, Generating ‘Art’ by Learning About Styles and Deviating from Style Norms”, Elgammal et al 2017. It’s a bit AI-GA in that it takes an inverted U-curve theory of novelty/art: a good new style is essentially any new style which you don’t like but your kids will in 15 years, because it’s a lot like, but not too much like, an existing style. CAN can probably be adapted to this setting.
- CAN is a multi-agent approach in trying to create novelty, but I think you can probably do something much simpler by directly targeting this idea of new-but-not-too-new, by exploiting embeddings of real data.
  
  If you embed & cluster your training data using the style-specific latents (which you’ve found by one of many existing approaches like embedding the names of stylistic movements to see what latents they average out to controlling, or by training a classifier, or just rating manually by eye), styles will form island-chains of works in each style, surrounded by darkness. One can look for suspicious holes, areas of darkness which get a high likelihood from the model, but are anomalously underrepresented in terms of how many embedded datapoints are nearby; these are ‘missing’ styles. The missing styles around a popular style are valuable directions to explore, something like alternative futures: ‘Impressionism wound up going thattaway but it could also have gone off this other way’. These could seed CAN approaches, or they could be used to bias regular generation: what if when a user prompts ‘Impressionist’ and gets back a dozen sample, each one is deliberately diversified to sample from a different missing style immediately adjacent to the ‘Impressionist’ point?
So, maybe.
- gwern 7 Sep 2022 1:30 UTC
  6 points
  Parent
  An interesting example of what might be a ‘name-less style’ in a generative image model, Stable Diffusion in this case (DALL-E 2 doesn’t give you the necessary access so users can’t experiment with this sort of thing): what the discoverer calls the “Loab” (mirror) image (for lack of a better name—what text prompt, if any, this image corresponds to is unknown, as it’s found by negation of a text prompt & search).
  
  ‘Loab’ is an image of a creepy old desaturated woman with ruddy cheeks in a wide face, which when hybridized with other images, reliably induces more images of her, or recognizably in the ‘Loab style’ (extreme levels of horror, gore, and old women). This is a little reminiscent of the discovered ‘Crungus’ monster, but ‘Loab style’ can happen, they say, even several generations of image breeding later when any obvious part of Loab is gone—which suggests to me there may be some subtle global property of descendant images which pulls them back to Loab-space and makes it ‘viral’, if you will. (Some sort of high-frequency non-robust or adversarial or steganographic phenomenon?) Very SCP.
  
  Apropos of my other comments on weird self-fulfilling prophecies and QAnon and stand-alone-complexes, it’s also worth noting that since Loab is going viral right now, Loab may be a name-less style now, but in future image generator models feeding on the updating corpus, because of all the discussion & sharing, it (like Crungus) may come to have a name - ‘Loab’.
- alexlyzhov 4 Jul 2022 20:25 UTC
  1 point
  Parent
  I wonder what happens when you ask it to generate
  > “in the style of a popular modern artist <unknown name>”
  or
  > “in the style of <random word stem>ism”.
  You could generate both types of prompts with GPT-3 if you wanted so it would be a complete pipeline.
  “Generate conditioned on the new style description” may be ready to be used even if “generate conditioned on an instruction to generate something new” is not. This is why a decomposition into new style description + image conditioned on it seems useful.
  If this is successful, then more of the high-level idea generation involved can be shifted onto a language model by letting it output a style description. Leave blanks in it and run it for each blank, while ensuring generations form a coherent story.
  >”<new style name>, sometimes referred to as <shortened version>, is a style of design, visual arts, <another area>, <another area> that first appeared in <country> after <event>. It influenced the design of <objects>, <objects>, <more objects>. <new style name> combined <combinatorial style characteristic> and <another style characteristic>. During its heyday, it represented <area of human life>, <emotion>, <emotion> and <attitude> towards <event>.”
  DALL-E can already model the distribution of possible contexts (image backgrounds, other objects, states of the object) + possible prompt meanings. An go from the description 1) to high-level concepts, 2) to ideas for implementing these concepts (relative placement of objects, ideas for how to merge concepts), 3) to low-level details. All within 1 forward pass, for all prompts! This is what astonished me most about DALL-E 1.
  Importantly, placing, implementing, and combining concepts in a picture is done in a novel way without a provided specification. For style generation, it would need to model a distribution over all possible styles and use each style, all without a style specification. This doesn’t seem much harder to me and could probably be achieved with slightly different training. The procedure I described is just supposed to introduce helpful stochasticity in the prompt and use an established generation conduit.

gwern comments on What DALL-E 2 can and cannot do

Can DALL·E Create New Styles?