I got access to Dall·E 2 yesterday. Here are some pretty pictures!
My goal was to try to understand what things DE2 could do well, and what things it had trouble understanding or generating. My general hypothesis is that it would do a better job with things that are easy to find on the internet (cute animals, digital scifi things, famous art) and less well with more abstract or more unusual things.
Here’s how it works: you put in a description of a picture, and it thinks for ~20 seconds and then produces 10 photos that are variations on that description. The diversity varies quite a bit depending on the prompt.
Let’s see some puppies!
One thing to be aware of when you see amazing pictures that DE2 generates, is that there is some cherry picking going on. It often takes a few prompts to find something awesome, so you might have looked at dozens of images or more.
Still, this is pretty great! Those are recognizably goldendoodle puppies, mostly in something approximating play position.
You can see that the proportions in the generated images are not quite right, and some of the detail is off if you look closely. For instance, the front legs are too long here, the face isn’t quite right, and the ears are a bit weird.
Still, it’s pretty amazing given that it generated this from scratch. Check out how realistic the grass looks. I also like that the background is blurred, though not quite in the way that a camera would do it—the transition is too abrupt.
Ok but the point of this isn’t that they have a great image generation transformer, though it’s clearly that. The key thing is is its magical ability to actually follow instructions or descriptions of images. Particularly interesting is compositionality—can it combine concepts to generate something it’s never seen before? Answer: yes!
The concept of “kitten” is pretty simply, though note that a kitten can be rendered in a ton of ways, from line drawings to cute art to photorealistic. Pop art is more complicated: it’s a celebration of everyday images, and one of the most commonly known versions is Warhol’s collection of repeated images in a grid with neon colors that vary per cell. And it mostly gets those things right.
What about weird things? You can put in any input and it’ll do something.
None of those are twitter worthy, but with some trial and error you can get things that are interesting.
“Digital style” is one of the suggestions for getting better images.
X in Y style is fun, that’s a lot of the images you see out in the world. Weirdly it’s pretty sensitive to exactly the order you put things in.
Back to puppies, you get pretty different results depending on the placement of “surrealistic” even though the rephrasings seem semantically identical or at least very similar.
One place where DE2 clearly falls down is in generating people. I generated an image for [four people playing poker in a dark room, with the table brightly lit by an ornate chandelier], and people didn’t look human—more like the typical GAN-style images where you can see the concept but the details are all wrong.
Update: image removed because the guidelines specifically call out not sharing realistic human faces.
Anything involving people, small defined objects, and so on, looks much more like the previous systems in this area. You can tell that it has all the concepts, but can’t translate them into something realistic.
This could be deliberate, for safety reasons—realistic images of people are much more open to abuse than other things. Porn, deep fakes, violence, and so on are much more worrisome with people. They also mentioned that they scrubbed out lots of bad stuff from the training data; possibly one way they did that was removing most images with people.
Things look much better with animals, and better again with an artistic style.
The cards aren’t right. Dice seem to be a lot easier.
People can also be pretty good if you don’t see faces, though the hands are definitely not right.
Stlalm Anit is my new slogan.
In general all writing I’ve seen is bad. I think this is less likely to be about safety, and more that it’s hard to learn language by looking at a lot of images. However, since DE2 is trained on text, it clearly knows a lot about language at some level—I would expect there’s plenty of data to put out coherent text. Instead it outputs nonsense, focusing on getting the fonts and the background right.
I definitely see serifs! I do not see sense.
Overall this is more powerful, flexible, and accurate than the previous best systems. It still is easy to find holes in it, with with some patience and willingness to iterate, you can make some amazing images.
In conclusion, generating a lot of images from a new state-of-the-art image generation system is fun, thanks for reading. If there’s interest, I can also explore in-painting and Here are a few more gratuitous pics!
The concept of beauty, according to DE2, is mostly women putting on makeup, which I can’t post due to restrictions on posting faces. These are really realistic, capturing ethnicity and expressing emotion, totally unlike the poker players from earlier. But there’s this one pastoral scene, which is nice.
This last one I edited out some floating writing on the left, and asked it to generate [a girl in a beautiful serene forest]. This one was also nice:
Seems kind of like generic anime and not so much Finnegan’s Wake.
What are those penguins on the bottom left doing?!?
This series suggests that DE2 gets reflections pretty well, but either doesn’t understand what it means to have something else be the reflection, or the prior for a reflection reflecting the thing looking in the mirror is too hard for it to override.
Here’s one where I edited out the cat in the mirror and changed the prompt to be about a dog, and it did something sensible.
It got it right twice out of 10 tries, that’s good right?
I tried to ask for Dall-E by name but that was a content policy violation.
It managed to get most of those elements in. Ultimately none of those is really satisfying though.
The good ones here had faces in them so I can’t post them. I like how random this one is.
...is surprisingly calm and beautiful.
Boo!
A pen and some gibberish… is actually a pretty good metaphor for intellectual progress?
“A spaceship made of legos” is just more of the same.
It got the marching part. I guess DE2 hasn’t ever played DnD.
Playing with DALL·E 2
I got access to Dall·E 2 yesterday. Here are some pretty pictures!
My goal was to try to understand what things DE2 could do well, and what things it had trouble understanding or generating. My general hypothesis is that it would do a better job with things that are easy to find on the internet (cute animals, digital scifi things, famous art) and less well with more abstract or more unusual things.
Here’s how it works: you put in a description of a picture, and it thinks for ~20 seconds and then produces 10 photos that are variations on that description. The diversity varies quite a bit depending on the prompt.
Let’s see some puppies!
One thing to be aware of when you see amazing pictures that DE2 generates, is that there is some cherry picking going on. It often takes a few prompts to find something awesome, so you might have looked at dozens of images or more.
Still, this is pretty great! Those are recognizably goldendoodle puppies, mostly in something approximating play position.
You can see that the proportions in the generated images are not quite right, and some of the detail is off if you look closely. For instance, the front legs are too long here, the face isn’t quite right, and the ears are a bit weird.
Still, it’s pretty amazing given that it generated this from scratch. Check out how realistic the grass looks. I also like that the background is blurred, though not quite in the way that a camera would do it—the transition is too abrupt.
Ok but the point of this isn’t that they have a great image generation transformer, though it’s clearly that. The key thing is is its magical ability to actually follow instructions or descriptions of images. Particularly interesting is compositionality—can it combine concepts to generate something it’s never seen before? Answer: yes!
The concept of “kitten” is pretty simply, though note that a kitten can be rendered in a ton of ways, from line drawings to cute art to photorealistic. Pop art is more complicated: it’s a celebration of everyday images, and one of the most commonly known versions is Warhol’s collection of repeated images in a grid with neon colors that vary per cell. And it mostly gets those things right.
What about weird things? You can put in any input and it’ll do something.
None of those are twitter worthy, but with some trial and error you can get things that are interesting.
“Digital style” is one of the suggestions for getting better images.
X in Y style is fun, that’s a lot of the images you see out in the world. Weirdly it’s pretty sensitive to exactly the order you put things in.
Back to puppies, you get pretty different results depending on the placement of “surrealistic” even though the rephrasings seem semantically identical or at least very similar.
One place where DE2 clearly falls down is in generating people. I generated an image for [four people playing poker in a dark room, with the table brightly lit by an ornate chandelier], and people didn’t look human—more like the typical GAN-style images where you can see the concept but the details are all wrong.
Update: image removed because the guidelines specifically call out not sharing realistic human faces.
Anything involving people, small defined objects, and so on, looks much more like the previous systems in this area. You can tell that it has all the concepts, but can’t translate them into something realistic.
This could be deliberate, for safety reasons—realistic images of people are much more open to abuse than other things. Porn, deep fakes, violence, and so on are much more worrisome with people. They also mentioned that they scrubbed out lots of bad stuff from the training data; possibly one way they did that was removing most images with people.
Things look much better with animals, and better again with an artistic style.
The cards aren’t right. Dice seem to be a lot easier.
People can also be pretty good if you don’t see faces, though the hands are definitely not right.
Stlalm Anit is my new slogan.
In general all writing I’ve seen is bad. I think this is less likely to be about safety, and more that it’s hard to learn language by looking at a lot of images. However, since DE2 is trained on text, it clearly knows a lot about language at some level—I would expect there’s plenty of data to put out coherent text. Instead it outputs nonsense, focusing on getting the fonts and the background right.
I definitely see serifs! I do not see sense.
Overall this is more powerful, flexible, and accurate than the previous best systems. It still is easy to find holes in it, with with some patience and willingness to iterate, you can make some amazing images.
In conclusion, generating a lot of images from a new state-of-the-art image generation system is fun, thanks for reading. If there’s interest, I can also explore in-painting and Here are a few more gratuitous pics!
Reader requests:
Is that more or less cool than the actual statue they built in Miami?
The concept of beauty, according to DE2, is mostly women putting on makeup, which I can’t post due to restrictions on posting faces. These are really realistic, capturing ethnicity and expressing emotion, totally unlike the poker players from earlier. But there’s this one pastoral scene, which is nice.
This last one I edited out some floating writing on the left, and asked it to generate [a girl in a beautiful serene forest]. This one was also nice:
Seems kind of like generic anime and not so much Finnegan’s Wake.
What are those penguins on the bottom left doing?!?
This series suggests that DE2 gets reflections pretty well, but either doesn’t understand what it means to have something else be the reflection, or the prior for a reflection reflecting the thing looking in the mirror is too hard for it to override.
Here’s one where I edited out the cat in the mirror and changed the prompt to be about a dog, and it did something sensible.
It got it right twice out of 10 tries, that’s good right?
I tried to ask for Dall-E by name but that was a content policy violation.
It managed to get most of those elements in. Ultimately none of those is really satisfying though.
The good ones here had faces in them so I can’t post them. I like how random this one is.
...is surprisingly calm and beautiful.
Boo!
A pen and some gibberish… is actually a pretty good metaphor for intellectual progress?
“A spaceship made of legos” is just more of the same.
It got the marching part. I guess DE2 hasn’t ever played DnD.