DallE2 is bad at prepositional phrases (above, inside) and negation. It can understand some sentence structure, but not reliably.
Goalpost moving. DALL-E 2 can generate samples matching lots of complex descriptions which are not ‘noun phrases’, and GLIDE is even better at it (also covered in the paper). You said it can’t. It can. Even narrowly, your claim is poorly supported, and for the broader discussion this is in the context of, misleading. You also have not provided any sources or general reasons for this sweeping assertion to be true, or for the broader implications you claimed these are good support for.
In the first example, none of those are paragraphs longer than a single sentence.
What happened to ‘noun phrases’?
In the first example, the images are not stylistically coherent! The bees are illustrated inconsistently from picture to picture. They look like they were drawn by different people working off of similar prompts and with similar materials.
Those images are stylistically coherent in being clearly in a pastel style and matching the text input. That meets the demand, and this is only a quick throwaway project establishing a lower bound on what DALL-E 2 can do. “Attacks only get better.”
That they are not, in addition to this, perfectly consistent with each other is too bad, but increased similarity is well within the scope of a DALL-E architecture through, just off the top of my head, the variations functionality, direct optimization by backprop, or CLIP rejection sampling.
You also have not provided any sources or general reasons for this sweeping assertion to be true.
The variational feature is not what I’m talking about
I don’t know why you look at that and say it’s not.
You also have not provided any sources or general reasons for this sweeping assertion to be true.
Goalpost moving. DALL-E 2 can generate samples matching lots of complex descriptions which are not ‘noun phrases’, and GLIDE is even better at it (also covered in the paper). You said it can’t. It can. Even narrowly, your claim is poorly supported, and for the broader discussion this is in the context of, misleading. You also have not provided any sources or general reasons for this sweeping assertion to be true, or for the broader implications you claimed these are good support for.
What happened to ‘noun phrases’?
Those images are stylistically coherent in being clearly in a pastel style and matching the text input. That meets the demand, and this is only a quick throwaway project establishing a lower bound on what DALL-E 2 can do. “Attacks only get better.”
That they are not, in addition to this, perfectly consistent with each other is too bad, but increased similarity is well within the scope of a DALL-E architecture through, just off the top of my head, the variations functionality, direct optimization by backprop, or CLIP rejection sampling.
You also have not provided any sources or general reasons for this sweeping assertion to be true.
I don’t know why you look at that and say it’s not.
You also have not provided any sources or general reasons for this sweeping assertion to be true.