Interesting. It seems to understand that the pattern should be “Three monkeys with hands on their heads somehow”, but it doesn’t seem to get that each monkey should have hands in a different position.
I wonder if that means gwern is wrong when he says DALL-E 2′s problem is that the text model compresses information, and the underlying “representation” model genuinely struggles with composition and “there must be three X with only a single Y among them” type of constraints.
I have been trying to think of another set of three items that are reliably found together, but this is all I could come up with. Pairs of items are much easier to come up with.
Thank you for sharing all of these DALL-E tests!
I wonder whether it can reproduce three objects that reliably appear together in images. How about one of these prompts:
A bronze statue of three wise monkeys.
See no evil, hear no evil, speak no evil, statue of monkeys.
“A bronze statue of three wise monkeys.” Pretty solid!
“See no evil, hear no evil, speak no evil, statue of monkeys.”
Interesting. It seems to understand that the pattern should be “Three monkeys with hands on their heads somehow”, but it doesn’t seem to get that each monkey should have hands in a different position.
I wonder if that means gwern is wrong when he says DALL-E 2′s problem is that the text model compresses information, and the underlying “representation” model genuinely struggles with composition and “there must be three X with only a single Y among them” type of constraints.
Thank you so much for this! It did do quite well.
I have been trying to think of another set of three items that are reliably found together, but this is all I could come up with. Pairs of items are much easier to come up with.
This is so good.