Wow, this is going to explode picture books and book covers.
Hiring an illustrator for a picture book costs a lot, as it should given it’s bespoke art.
Now publishers will have an editor type in page descriptions, curate the best and off they go. I can easily imagine a model improvement to remember the boy drawn or steampunk bear etc.
Book cover designers are in trouble too. A wizard with lighting in hands while mountain explodes behind him—this can generate multiple options.
It’s going to get really wild when A/B split testing is involved. As you mention regarding ads you’d give the system the power to make whatever images it wanted and then split test. Letting it write headlines would work too.
Perhaps a full animated movie down the line. There are already programs that fill in gaps for animation poses. Boy running across field chased by robot penguins—animated, eight seconds. And so on. At that point it’s like Pixar in a box. We’ll see an explosion of directors who work alone, typing descriptions, testing camera angles, altering scenes on the fly. Do that again but more violent. Do that again but with more blood splatter.
Animation in the style of Family Guy seems a natural first step there. Solid colours, less variation, not messing with light rippling etc.
There’s a service authors use of illustrated chapter breaks, a black and white dragon snoozing, roses around knives, that sort of thing. No need to hire an illustrator now.
Conversion of all fiction novels to graphic novel format. At first it’ll be laborious, typing in scene descriptions but graphic novel art is really expensive now. I can see a publisher hiring a freelancer to produce fifty graphic novels from existing titles.
With a bit of memory, so once I choose the image of each character I want, this is an amazing game changer for publishing.
Storyboarding requires no drawing skill now. Couple sprinting down dark alley chased by robots.
Game companies can use it to rapid prototype looks and styles. They can do all that background art by typing descriptions and saving the best.
We’re going to end up with people who are famous illustrator who can’t draw but have created amazing styles using this and then made books.
Thanks so much for this post. This is wild astonishing stuff. As an author who is about to throw large sums of money at cover design, it’s incredible to think a commercial version of this could do it for a fraction of the price.
edit: just going to add some more
App design that requires art. For example many multiple choice story apps that are costly to make due to art cost.
Split-tested covers designs for pretty much anything—books, music, albums, posters. Generate, ad campaign, test clicks. An ad business will be able to throw up a 1000 completely different variations in a day.
All catalogs/brochures that currently use stock art. While choosing stock art to make things works it also sucks and is annoying with the limited range. I’m imagining a stock art company could radically expand their selection to keep people buying from them. All those searches that people have typed in are now prompts.
Illustrating wikipedia. Many articles need images to demonstrate a point and rely on contributors making them. This could open up improvements in the volume of images and quality.
Graphic novels/comic books—writers who don’t need artists essentially. To start it will be describing single panels and manually adding speech text but that’s still faster and cheaper than hiring an artist. For publishers—why pick and choose what becomes a graphic novel when you can just make every title into a graphic novel.
Youtube/video interstitial art. No more stock photos.
Licensed characters (think Paw Patrol, Disney, Dreamworks) - creation of endless poses, scenes. No more waiting for Dreamworks to produce 64 pieces of black and white line art when it may be able to take the movie frames and create images from that.
Adaptations—the 24-page storybook of Finding Nemo. The 24-page storybook of Pinocchio. The picture book of Fast and The Furious.
Looking further ahead we might even see a drop-down option of existing comics, graphic novels but in a different art style. Reading the same Spiderman story but illustrated by someone else.
Character design—for games, licensing, children’s animation. This radically expands the volume of characters that can be designed, selected and then chosen for future scenes.
With some sort of “keep this style”, “save that character” method, it really would be possible to generate a 24-page picture book in an incredibly short amount of time.
Quite frankly, knowing how it works, I’d write a picture book of a kid going through different art styles in their adventure. Chasing their puppy through the art museum and the dog runs into a painting. First Van Gogh, then Da Vinci and so on. The kid changes appearance due to the model but that works for the story.
As a commercial produce, this system would be incredible. I expect we’ll see an explosion in the number of picture books, graphic novels, posters, art designs, etsy prints, downloadable files and so on. Publishers with huge backlists would be a prime customer.
Perhaps a full animated movie down the line. There are already programs that fill in gaps for animation poses. Boy running across field chased by robot penguins—animated, eight seconds.
Video is on the horizon (video generation bibliography eg. FDM), in the 1-3 year range. I would say that video is solved conceptually in the sense that if you had 100x the compute budget, you could do DALL-E-2-but-for-video right now already. After all, if you can do a single image which is sensible and logical, then a video is simply doing that repeatedly. Nor is there any shortage of video footage to work with. The problem there is that a video is a lot of images: at least 24 images per second, so you could have 192 different samples, or 1 8s clip. Most people will prefer the former: decorating, say, a hundred blog posts with illustrations is more useful than a single OK short video clip of someone dancing.
So video’s game is mostly about whether you can come up with an approach which can somehow economize on that, like clever tricks in reusing frames to update only a little while updating a latent vector, as a way to take a shortcut to that point in the future where you had so much compute that the obvious Transformer & Diffusion models can be run in reasonable compute-budgets & video ‘just worked’.
And either way, it may be the revolution that robotics requires (video is a great way to plan).
Following up on your logic here, the one thing that DALLE-2 hasn’t done, to my knowledge, is generate entirely new styles of art, the way that art deco or pointillism were truly different from their predecessors.
Perhaps that’ll be the new of of human illustrators? Artists, instead of producing their own works to sell, would instead create their own styles, generating libraries of content for future DALLEs to be trained against. They then make a percentage on whatever DALLE makes from image sales if the style used was their own.
Most DALL·E questions can be answered by just reading the paper of it or its competitors, or are dumb. This is probably the most interesting question that can’t be, and also one of the most common: can DALL·E (which we’ll use just as a generic representative of image generative models, since no one argues that one arch or model can and the others cannot AFAIK) invent a new style? DALL·E is, like GPT-3 in text, admittedly an incredible mimic of many styles, and appears to have gone well beyond any mere ‘memorization’ of the images depicting styles because it can so seamlessly insert random objects into arbitrary styles (hence all the “Kermit Through The Ages” or “Mughal space rocket” variants); but simply being a gifted epigone of most existing styles is not guarantee you can create a new one.
If we asked a Martian what ‘style’ was, it would probably conclude that “‘style’ is what you call it when some especially mentally-ill humans output the same mistakes for so long that other humans wearing nooses try to hide the defective output by throwing small pieces of green paper at the outputs, and a third group of humans wearing dresses try to exchange large white pieces of paper with black marks on them for the smaller green papers”.
Not the best definition, but it does provide one answer: since DALL·E is just a blob of binary which gets run on a GPU, it is incapable of inventing a style because it can’t take credit for it or get paid for it or ally with gallerists and journalists to create a new fashion, so the pragmatic answer is just ‘no’, no more than your visual cortex could. So, no. This is unsatisfactory, however, because it just punts to, ‘could humans create a new style with DALL·E?’ and then the answer to that is simply, ‘yes, why not? Art has no rules these days: if you can get someone to pay millions for a rotting shark or half a mill for a blurry DCGAN portrait, we sure as heck can’t rule out someone taking some DALL·E output and getting paid for it.’ After all, DALL·E won’t complain (again, no more than your visual cortex would). Also unsatisfactory but it is at least testable: has anyone gotten paid yet? (Of course artists will usually try to minimize or lie about it to protect their trade secrets, but at some point someone will ’fess up or it become obvious.) So, yes.
Let’s take ‘style’ to be some principled, real, systematic visual system of esthetics.
Regular use of DALL·E, of course, would not produce a new style: what would be the name of this style in the prompt? “Unnamed new style”? Obviously, if you prompt DALL·E for “a night full of stars, Impressionism”, you will get what you ask for. What are the Internet-scraped image/text caption pairs which would correspond to the creation of a new style, exactly? “A dazzling image of an unnamed new style being born | Artstation | digital painting”? There may well be actual image captions out there which do say something like that, but surely far too few to induce some sort of zero-shot new-style creation ability. Humans too would struggle with such an instruction. (Although it’s fun to imagine trying to commission that from a human artist on Fiverr for $50, say: “an image of a cute cat, in a totally brand-new never before seen style.” “A what?” “A new style.” “I’m really best at anime-style illustrations, you know.” “I know. Still, I’d like ‘a brand new style’. Also, I’d like to commission a second one after that too, same prompt.” ”...would you like a refund?”)
Still, perhaps DALL·E might invent a new style anyway just as part of normal random sampling? Surely if you generated enough images it’d eventually output something novel? However, DALL·E isn’t trying to do so, it is ‘trying’ to do something closer to generating the single most plausible image for a given text input, or to some minor degree, sampling from the posterior distribution of the Internet images + commercial licensed image dataset it was trained on. To the extent that a new style is possible, it ought to be extremely rare, because it is not, in fact, in the training data distribution (by definition, it’s novel), and even if DALL·E 2 ‘mistakenly’ does so, it would be true that this newborn style would be extremely rare because it is so unpopular compared to all the popular styles: 1 in millions or billions.
Let’s say it defied the odds and did anyway, since OA has generated millions of DALL·E 2 samples already according to their PR. ‘Style’ is something of a unicorn: if DALL·E could (or had already) invented a new style… how would we know? If Impressionism had never existed and Van Gogh’s Starry Night flashed up on the screen of a DALL·E 2 user late one night, they would probably go ‘huh, weird blobby effect, not sure I like it’ and then generate new completions—rather than herald it as the ultimate exemplar of a major style and destined to be one of the most popular (to the point of kitsch).
Finally, if someone did seize on a sample from a style-less prompt because it looked new to them and wanted to generate more, they would be out of luck: DALL·E 2 can generate variations on an image, yes, but this unavoidably is a mashup of all of the content and style and details in an image. There is not really any native way to say ‘take the cool new style of this image and apply it to another’. You are stuck with hacks: you can try shrinking the image to uncrop, or halve it and paste in a target image to infill, or you can go outside DALL·E 2 entirely and use it in a standard style-transfer NN as the original style image… But there is no way to extract the ‘style’ as an easily reused keyword or tool the way you can apply ‘Impressionism’ to any prompt.
This is a bad situation. You can’t ask for a new style by name because it has none; you can’t talk about it without naming it because no one does that for new real-world styles, they name it; and if you don’t talk about it, a new style has vanishingly low odds of being generated, and you wouldn’t recognize it, nor could you make any good use of it if you did.
So, no.
DALL·E might be perfectly capable of creating a new style in some sense, but the interface renders this utterly opaque, hidden dark knowledge.
We can be pretty sure that DALL·E knows styles as styles rather than some mashup of physical objects/colors/shapes: just like large language models imitate or can be prompted to be more or less rude, more or less accurate, more or less calibrated, generate more or less buggy or insecure code, etc., large image models disentangle and learn pretty cool generic capabilities: not just individual styles, but ‘award-winning’ or ‘trending on Artstation’ or ‘drawn by an amateur’.
Further, we can point to things like style transfer: you can use a VGG CNN trained solely on ImageNet, with near-zero artwork in it (and definitely not a lot of Impressionist paintings), to fairly convincingly stylize images in the style of “Starry Night”—VGG has never seen “Starry Night”, and may never have seen a painting, period, so how does it do this?
Where DALL·E knows about styles is in its latent space (or VGG’s Gram matrix embedding): the latent space is an incredibly powerful way to boil down images, and manipulation of the latent space can go beyond ordinary samples to make, say, a face StyleGAN generate cars or cats instead—there’s a latent for that.
Even things which seem to require ‘extrapolation’ are still ‘in’ the capacious latent space somewhere, and probably not even that far away: in very high dimensional spaces, everything is ‘interpolation’ because everything is an ‘outlier’; why should a ‘new style’ be all that far away from the latent points corresponding to well-known styles?
All text prompts and variations are just hamfisted ways of manipulating the latent space. The text prompt is just there to be encoded by CLIP into a latent space. The latent space is what encodes the knowledge of the model, and if we can manipulate the latent space, we can unlock all sorts of capabilities like in face GANs, where you can find latent variables which correspond to, say, wearing eyeglasses or smiling vs frowning—no need to mess around with trying to use CLIP to guide a ‘smile’ prompt if you can just tweak the knob directly.
Unless, of course, you can’t tweak the knob directly, because it’s behind an API and you have no way of getting or setting the embedding, much less doing gradient ascent. Yeah, then you’re boned.
So the answer here becomes, ‘no, for now: DALL·E 2 can’t in practice because you can’t use it in the necessary way, but when some equivalent model gets released, then it becomes possible (growth mindset!).’
Let’s say we have that model, because it surely won’t be too long before one gets released publicly, maybe a year or two at the most. And public models like DALL·E Mini might be good enough already. How would we go about it concretely?
‘Copying style embedding’ features alone would be a big boost: if you could at least cut out and save the style part of an embedding and use it for future prompts/editing, then when you found something you liked, you could keep it.
‘Novelty search’ has a long history in evolutionary computation, and offers a lot of different approaches. Defining ‘fitness’ or ‘novelty’ is a big problem here, but the models themselves can be used for that: novelty as compared against the data embeddings, optimizing the score of a large ensemble of randomly-initialized NNs (see also my recent essay on constrained optimization as esthetics) or NNs trained on subsets (such as specific art movements, to see what ‘hyper Impressionism’ looks like) or...
Preference-learning reinforcement learning is a standard approach: try to train novelty generation directly. DRL is always hard though.
CAN is a multi-agent approach in trying to create novelty, but I think you can probably do something much simpler by directly targeting this idea of new-but-not-too-new, by exploiting embeddings of real data.
If you embed & cluster your training data using the style-specific latents (which you’ve found by one of many existing approaches like embedding the names of stylistic movements to see what latents they average out to controlling, or by training a classifier, or just rating manually by eye), styles will form island-chains of works in each style, surrounded by darkness. One can look for suspicious holes, areas of darkness which get a high likelihood from the model, but are anomalously underrepresented in terms of how many embedded datapoints are nearby; these are ‘missing’ styles. The missing styles around a popular style are valuable directions to explore, something like alternative futures: ‘Impressionism wound up going thattaway but it could also have gone off this other way’. These could seed CAN approaches, or they could be used to bias regular generation: what if when a user prompts ‘Impressionist’ and gets back a dozen sample, each one is deliberately diversified to sample from a different missing style immediately adjacent to the ‘Impressionist’ point?
An interesting example of what might be a ‘name-less style’ in a generative image model, Stable Diffusion in this case (DALL-E 2 doesn’t give you the necessary access so users can’t experiment with this sort of thing): what the discoverer calls the “Loab” (mirror) image (for lack of a better name—what text prompt, if any, this image corresponds to is unknown, as it’s found by negation of a text prompt & search).
‘Loab’ is an image of a creepy old desaturated woman with ruddy cheeks in a wide face, which when hybridized with other images, reliably induces more images of her, or recognizably in the ‘Loab style’ (extreme levels of horror, gore, and old women). This is a little reminiscent of the discovered ‘Crungus’ monster, but ‘Loab style’ can happen, they say, even several generations of image breeding later when any obvious part of Loab is gone—which suggests to me there may be some subtle global property of descendant images which pulls them back to Loab-space and makes it ‘viral’, if you will. (Some sort of high-frequency non-robust or adversarial or steganographic phenomenon?) Very SCP.
Apropos of my other comments on weird self-fulfilling prophecies and QAnon and stand-alone-complexes, it’s also worth noting that since Loab is going viral right now, Loab may be a name-less style now, but in future image generator models feeding on the updating corpus, because of all the discussion & sharing, it (like Crungus) may come to have a name - ‘Loab’.
I wonder what happens when you ask it to generate > “in the style of a popular modern artist <unknown name>” or > “in the style of <random word stem>ism”. You could generate both types of prompts with GPT-3 if you wanted so it would be a complete pipeline.
“Generate conditioned on the new style description” may be ready to be used even if “generate conditioned on an instruction to generate something new” is not. This is why a decomposition into new style description + image conditioned on it seems useful.
If this is successful, then more of the high-level idea generation involved can be shifted onto a language model by letting it output a style description. Leave blanks in it and run it for each blank, while ensuring generations form a coherent story.
>”<new style name>, sometimes referred to as <shortened version>, is a style of design, visual arts, <another area>, <another area> that first appeared in <country> after <event>. It influenced the design of <objects>, <objects>, <more objects>. <new style name> combined <combinatorial style characteristic> and <another style characteristic>. During its heyday, it represented <area of human life>, <emotion>, <emotion> and <attitude> towards <event>.”
DALL-E can already model the distribution of possible contexts (image backgrounds, other objects, states of the object) + possible prompt meanings. An go from the description 1) to high-level concepts, 2) to ideas for implementing these concepts (relative placement of objects, ideas for how to merge concepts), 3) to low-level details. All within 1 forward pass, for all prompts! This is what astonished me most about DALL-E 1.
Importantly, placing, implementing, and combining concepts in a picture is done in a novel way without a provided specification. For style generation, it would need to model a distribution over all possible styles and use each style, all without a style specification. This doesn’t seem much harder to me and could probably be achieved with slightly different training. The procedure I described is just supposed to introduce helpful stochasticity in the prompt and use an established generation conduit.
...Hmm now I’m wondering if feeding DALL-E an “in the style of [ ]” request with random keywords in the blank might cause it do replicable weird styles, or if it would just get confused and do something different every time.
Wow, this is going to explode picture books and book covers.
Hiring an illustrator for a picture book costs a lot, as it should given it’s bespoke art.
Now publishers will have an editor type in page descriptions, curate the best and off they go. I can easily imagine a model improvement to remember the boy drawn or steampunk bear etc.
Book cover designers are in trouble too. A wizard with lighting in hands while mountain explodes behind him—this can generate multiple options.
It’s going to get really wild when A/B split testing is involved. As you mention regarding ads you’d give the system the power to make whatever images it wanted and then split test. Letting it write headlines would work too.
Perhaps a full animated movie down the line. There are already programs that fill in gaps for animation poses. Boy running across field chased by robot penguins—animated, eight seconds. And so on. At that point it’s like Pixar in a box. We’ll see an explosion of directors who work alone, typing descriptions, testing camera angles, altering scenes on the fly. Do that again but more violent. Do that again but with more blood splatter.
Animation in the style of Family Guy seems a natural first step there. Solid colours, less variation, not messing with light rippling etc.
There’s a service authors use of illustrated chapter breaks, a black and white dragon snoozing, roses around knives, that sort of thing. No need to hire an illustrator now.
Conversion of all fiction novels to graphic novel format. At first it’ll be laborious, typing in scene descriptions but graphic novel art is really expensive now. I can see a publisher hiring a freelancer to produce fifty graphic novels from existing titles.
With a bit of memory, so once I choose the image of each character I want, this is an amazing game changer for publishing.
Storyboarding requires no drawing skill now. Couple sprinting down dark alley chased by robots.
Game companies can use it to rapid prototype looks and styles. They can do all that background art by typing descriptions and saving the best.
We’re going to end up with people who are famous illustrator who can’t draw but have created amazing styles using this and then made books.
Thanks so much for this post. This is wild astonishing stuff. As an author who is about to throw large sums of money at cover design, it’s incredible to think a commercial version of this could do it for a fraction of the price.
edit: just going to add some more
App design that requires art. For example many multiple choice story apps that are costly to make due to art cost.
Split-tested covers designs for pretty much anything—books, music, albums, posters. Generate, ad campaign, test clicks. An ad business will be able to throw up a 1000 completely different variations in a day.
All catalogs/brochures that currently use stock art. While choosing stock art to make things works it also sucks and is annoying with the limited range. I’m imagining a stock art company could radically expand their selection to keep people buying from them. All those searches that people have typed in are now prompts.
Illustrating wikipedia. Many articles need images to demonstrate a point and rely on contributors making them. This could open up improvements in the volume of images and quality.
Graphic novels/comic books—writers who don’t need artists essentially. To start it will be describing single panels and manually adding speech text but that’s still faster and cheaper than hiring an artist. For publishers—why pick and choose what becomes a graphic novel when you can just make every title into a graphic novel.
Youtube/video interstitial art. No more stock photos.
Licensed characters (think Paw Patrol, Disney, Dreamworks) - creation of endless poses, scenes. No more waiting for Dreamworks to produce 64 pieces of black and white line art when it may be able to take the movie frames and create images from that.
Adaptations—the 24-page storybook of Finding Nemo. The 24-page storybook of Pinocchio. The picture book of Fast and The Furious.
Looking further ahead we might even see a drop-down option of existing comics, graphic novels but in a different art style. Reading the same Spiderman story but illustrated by someone else.
Character design—for games, licensing, children’s animation. This radically expands the volume of characters that can be designed, selected and then chosen for future scenes.
With some sort of “keep this style”, “save that character” method, it really would be possible to generate a 24-page picture book in an incredibly short amount of time.
Quite frankly, knowing how it works, I’d write a picture book of a kid going through different art styles in their adventure. Chasing their puppy through the art museum and the dog runs into a painting. First Van Gogh, then Da Vinci and so on. The kid changes appearance due to the model but that works for the story.
As a commercial produce, this system would be incredible. I expect we’ll see an explosion in the number of picture books, graphic novels, posters, art designs, etsy prints, downloadable files and so on. Publishers with huge backlists would be a prime customer.
Video is on the horizon (video generation bibliography eg. FDM), in the 1-3 year range. I would say that video is solved conceptually in the sense that if you had 100x the compute budget, you could do DALL-E-2-but-for-video right now already. After all, if you can do a single image which is sensible and logical, then a video is simply doing that repeatedly. Nor is there any shortage of video footage to work with. The problem there is that a video is a lot of images: at least 24 images per second, so you could have 192 different samples, or 1 8s clip. Most people will prefer the former: decorating, say, a hundred blog posts with illustrations is more useful than a single OK short video clip of someone dancing.
So video’s game is mostly about whether you can come up with an approach which can somehow economize on that, like clever tricks in reusing frames to update only a little while updating a latent vector, as a way to take a shortcut to that point in the future where you had so much compute that the obvious Transformer & Diffusion models can be run in reasonable compute-budgets & video ‘just worked’.
And either way, it may be the revolution that robotics requires (video is a great way to plan).
Following up on your logic here, the one thing that DALLE-2 hasn’t done, to my knowledge, is generate entirely new styles of art, the way that art deco or pointillism were truly different from their predecessors.
Perhaps that’ll be the new of of human illustrators? Artists, instead of producing their own works to sell, would instead create their own styles, generating libraries of content for future DALLEs to be trained against. They then make a percentage on whatever DALLE makes from image sales if the style used was their own.
Can DALL·E Create New Styles?
Most DALL·E questions can be answered by just reading the paper of it or its competitors, or are dumb. This is probably the most interesting question that can’t be, and also one of the most common: can DALL·E (which we’ll use just as a generic representative of image generative models, since no one argues that one arch or model can and the others cannot AFAIK) invent a new style? DALL·E is, like GPT-3 in text, admittedly an incredible mimic of many styles, and appears to have gone well beyond any mere ‘memorization’ of the images depicting styles because it can so seamlessly insert random objects into arbitrary styles (hence all the “Kermit Through The Ages” or “Mughal space rocket” variants); but simply being a gifted epigone of most existing styles is not guarantee you can create a new one.
If we asked a Martian what ‘style’ was, it would probably conclude that “‘style’ is what you call it when some especially mentally-ill humans output the same mistakes for so long that other humans wearing nooses try to hide the defective output by throwing small pieces of green paper at the outputs, and a third group of humans wearing dresses try to exchange large white pieces of paper with black marks on them for the smaller green papers”.
Not the best definition, but it does provide one answer: since DALL·E is just a blob of binary which gets run on a GPU, it is incapable of inventing a style because it can’t take credit for it or get paid for it or ally with gallerists and journalists to create a new fashion, so the pragmatic answer is just ‘no’, no more than your visual cortex could. So, no. This is unsatisfactory, however, because it just punts to, ‘could humans create a new style with DALL·E?’ and then the answer to that is simply, ‘yes, why not? Art has no rules these days: if you can get someone to pay millions for a rotting shark or half a mill for a blurry DCGAN portrait, we sure as heck can’t rule out someone taking some DALL·E output and getting paid for it.’ After all, DALL·E won’t complain (again, no more than your visual cortex would). Also unsatisfactory but it is at least testable: has anyone gotten paid yet? (Of course artists will usually try to minimize or lie about it to protect their trade secrets, but at some point someone will ’fess up or it become obvious.) So, yes.
Let’s take ‘style’ to be some principled, real, systematic visual system of esthetics. Regular use of DALL·E, of course, would not produce a new style: what would be the name of this style in the prompt? “Unnamed new style”? Obviously, if you prompt DALL·E for “a night full of stars, Impressionism”, you will get what you ask for. What are the Internet-scraped image/text caption pairs which would correspond to the creation of a new style, exactly? “A dazzling image of an unnamed new style being born | Artstation | digital painting”? There may well be actual image captions out there which do say something like that, but surely far too few to induce some sort of zero-shot new-style creation ability. Humans too would struggle with such an instruction. (Although it’s fun to imagine trying to commission that from a human artist on Fiverr for $50, say: “an image of a cute cat, in a totally brand-new never before seen style.” “A what?” “A new style.” “I’m really best at anime-style illustrations, you know.” “I know. Still, I’d like ‘a brand new style’. Also, I’d like to commission a second one after that too, same prompt.” ”...would you like a refund?”)
Still, perhaps DALL·E might invent a new style anyway just as part of normal random sampling? Surely if you generated enough images it’d eventually output something novel? However, DALL·E isn’t trying to do so, it is ‘trying’ to do something closer to generating the single most plausible image for a given text input, or to some minor degree, sampling from the posterior distribution of the Internet images + commercial licensed image dataset it was trained on. To the extent that a new style is possible, it ought to be extremely rare, because it is not, in fact, in the training data distribution (by definition, it’s novel), and even if DALL·E 2 ‘mistakenly’ does so, it would be true that this newborn style would be extremely rare because it is so unpopular compared to all the popular styles: 1 in millions or billions.
Let’s say it defied the odds and did anyway, since OA has generated millions of DALL·E 2 samples already according to their PR. ‘Style’ is something of a unicorn: if DALL·E could (or had already) invented a new style… how would we know? If Impressionism had never existed and Van Gogh’s Starry Night flashed up on the screen of a DALL·E 2 user late one night, they would probably go ‘huh, weird blobby effect, not sure I like it’ and then generate new completions—rather than herald it as the ultimate exemplar of a major style and destined to be one of the most popular (to the point of kitsch).
Finally, if someone did seize on a sample from a style-less prompt because it looked new to them and wanted to generate more, they would be out of luck: DALL·E 2 can generate variations on an image, yes, but this unavoidably is a mashup of all of the content and style and details in an image. There is not really any native way to say ‘take the cool new style of this image and apply it to another’. You are stuck with hacks: you can try shrinking the image to uncrop, or halve it and paste in a target image to infill, or you can go outside DALL·E 2 entirely and use it in a standard style-transfer NN as the original style image… But there is no way to extract the ‘style’ as an easily reused keyword or tool the way you can apply ‘Impressionism’ to any prompt.
This is a bad situation. You can’t ask for a new style by name because it has none; you can’t talk about it without naming it because no one does that for new real-world styles, they name it; and if you don’t talk about it, a new style has vanishingly low odds of being generated, and you wouldn’t recognize it, nor could you make any good use of it if you did. So, no.
DALL·E might be perfectly capable of creating a new style in some sense, but the interface renders this utterly opaque, hidden dark knowledge. We can be pretty sure that DALL·E knows styles as styles rather than some mashup of physical objects/colors/shapes: just like large language models imitate or can be prompted to be more or less rude, more or less accurate, more or less calibrated, generate more or less buggy or insecure code, etc., large image models disentangle and learn pretty cool generic capabilities: not just individual styles, but ‘award-winning’ or ‘trending on Artstation’ or ‘drawn by an amateur’. Further, we can point to things like style transfer: you can use a VGG CNN trained solely on ImageNet, with near-zero artwork in it (and definitely not a lot of Impressionist paintings), to fairly convincingly stylize images in the style of “Starry Night”—VGG has never seen “Starry Night”, and may never have seen a painting, period, so how does it do this?
Where DALL·E knows about styles is in its latent space (or VGG’s Gram matrix embedding): the latent space is an incredibly powerful way to boil down images, and manipulation of the latent space can go beyond ordinary samples to make, say, a face StyleGAN generate cars or cats instead—there’s a latent for that. Even things which seem to require ‘extrapolation’ are still ‘in’ the capacious latent space somewhere, and probably not even that far away: in very high dimensional spaces, everything is ‘interpolation’ because everything is an ‘outlier’; why should a ‘new style’ be all that far away from the latent points corresponding to well-known styles?
All text prompts and variations are just hamfisted ways of manipulating the latent space. The text prompt is just there to be encoded by CLIP into a latent space. The latent space is what encodes the knowledge of the model, and if we can manipulate the latent space, we can unlock all sorts of capabilities like in face GANs, where you can find latent variables which correspond to, say, wearing eyeglasses or smiling vs frowning—no need to mess around with trying to use CLIP to guide a ‘smile’ prompt if you can just tweak the knob directly.
Unless, of course, you can’t tweak the knob directly, because it’s behind an API and you have no way of getting or setting the embedding, much less doing gradient ascent. Yeah, then you’re boned. So the answer here becomes, ‘no, for now: DALL·E 2 can’t in practice because you can’t use it in the necessary way, but when some equivalent model gets released, then it becomes possible (growth mindset!).’
Let’s say we have that model, because it surely won’t be too long before one gets released publicly, maybe a year or two at the most. And public models like DALL·E Mini might be good enough already. How would we go about it concretely?
‘Copying style embedding’ features alone would be a big boost: if you could at least cut out and save the style part of an embedding and use it for future prompts/editing, then when you found something you liked, you could keep it.
‘Novelty search’ has a long history in evolutionary computation, and offers a lot of different approaches. Defining ‘fitness’ or ‘novelty’ is a big problem here, but the models themselves can be used for that: novelty as compared against the data embeddings, optimizing the score of a large ensemble of randomly-initialized NNs (see also my recent essay on constrained optimization as esthetics) or NNs trained on subsets (such as specific art movements, to see what ‘hyper Impressionism’ looks like) or...
Preference-learning reinforcement learning is a standard approach: try to train novelty generation directly. DRL is always hard though.
One approach worth looking at is “CAN: Creative Adversarial Networks, Generating ‘Art’ by Learning About Styles and Deviating from Style Norms”, Elgammal et al 2017. It’s a bit AI-GA in that it takes an inverted U-curve theory of novelty/art: a good new style is essentially any new style which you don’t like but your kids will in 15 years, because it’s a lot like, but not too much like, an existing style. CAN can probably be adapted to this setting.
CAN is a multi-agent approach in trying to create novelty, but I think you can probably do something much simpler by directly targeting this idea of new-but-not-too-new, by exploiting embeddings of real data.
If you embed & cluster your training data using the style-specific latents (which you’ve found by one of many existing approaches like embedding the names of stylistic movements to see what latents they average out to controlling, or by training a classifier, or just rating manually by eye), styles will form island-chains of works in each style, surrounded by darkness. One can look for suspicious holes, areas of darkness which get a high likelihood from the model, but are anomalously underrepresented in terms of how many embedded datapoints are nearby; these are ‘missing’ styles. The missing styles around a popular style are valuable directions to explore, something like alternative futures: ‘Impressionism wound up going thattaway but it could also have gone off this other way’. These could seed CAN approaches, or they could be used to bias regular generation: what if when a user prompts ‘Impressionist’ and gets back a dozen sample, each one is deliberately diversified to sample from a different missing style immediately adjacent to the ‘Impressionist’ point?
So, maybe.
An interesting example of what might be a ‘name-less style’ in a generative image model, Stable Diffusion in this case (DALL-E 2 doesn’t give you the necessary access so users can’t experiment with this sort of thing): what the discoverer calls the “Loab” (mirror) image (for lack of a better name—what text prompt, if any, this image corresponds to is unknown, as it’s found by negation of a text prompt & search).
‘Loab’ is an image of a creepy old desaturated woman with ruddy cheeks in a wide face, which when hybridized with other images, reliably induces more images of her, or recognizably in the ‘Loab style’ (extreme levels of horror, gore, and old women). This is a little reminiscent of the discovered ‘Crungus’ monster, but ‘Loab style’ can happen, they say, even several generations of image breeding later when any obvious part of Loab is gone—which suggests to me there may be some subtle global property of descendant images which pulls them back to Loab-space and makes it ‘viral’, if you will. (Some sort of high-frequency non-robust or adversarial or steganographic phenomenon?) Very SCP.
Apropos of my other comments on weird self-fulfilling prophecies and QAnon and stand-alone-complexes, it’s also worth noting that since Loab is going viral right now, Loab may be a name-less style now, but in future image generator models feeding on the updating corpus, because of all the discussion & sharing, it (like Crungus) may come to have a name - ‘Loab’.
I wonder what happens when you ask it to generate
> “in the style of a popular modern artist <unknown name>”
or
> “in the style of <random word stem>ism”.
You could generate both types of prompts with GPT-3 if you wanted so it would be a complete pipeline.
“Generate conditioned on the new style description” may be ready to be used even if “generate conditioned on an instruction to generate something new” is not. This is why a decomposition into new style description + image conditioned on it seems useful.
If this is successful, then more of the high-level idea generation involved can be shifted onto a language model by letting it output a style description. Leave blanks in it and run it for each blank, while ensuring generations form a coherent story.
>”<new style name>, sometimes referred to as <shortened version>, is a style of design, visual arts, <another area>, <another area> that first appeared in <country> after <event>. It influenced the design of <objects>, <objects>, <more objects>. <new style name> combined <combinatorial style characteristic> and <another style characteristic>. During its heyday, it represented <area of human life>, <emotion>, <emotion> and <attitude> towards <event>.”
DALL-E can already model the distribution of possible contexts (image backgrounds, other objects, states of the object) + possible prompt meanings. An go from the description 1) to high-level concepts, 2) to ideas for implementing these concepts (relative placement of objects, ideas for how to merge concepts), 3) to low-level details. All within 1 forward pass, for all prompts! This is what astonished me most about DALL-E 1.
Importantly, placing, implementing, and combining concepts in a picture is done in a novel way without a provided specification. For style generation, it would need to model a distribution over all possible styles and use each style, all without a style specification. This doesn’t seem much harder to me and could probably be achieved with slightly different training. The procedure I described is just supposed to introduce helpful stochasticity in the prompt and use an established generation conduit.
...Hmm now I’m wondering if feeding DALL-E an “in the style of [ ]” request with random keywords in the blank might cause it do replicable weird styles, or if it would just get confused and do something different every time.
I’d love to see it tried. Maybe even ask for “in the style of DALLE-2”?
“A woman riding a horse, in the style of DALLE-2”
I have no idea how to interpret this. Any ideas?
It seems like we got a variety of different styles, with red, blue, black, and white as the dominant colors.
Can we say that DALLE-2 has a style of its own?