Interesting. It seems to understand that the pattern should be “Three monkeys with hands on their heads somehow”, but it doesn’t seem to get that each monkey should have hands in a different position.
I wonder if that means gwern is wrong when he says DALL-E 2′s problem is that the text model compresses information, and the underlying “representation” model genuinely struggles with composition and “there must be three X with only a single Y among them” type of constraints.
Interesting. It seems to understand that the pattern should be “Three monkeys with hands on their heads somehow”, but it doesn’t seem to get that each monkey should have hands in a different position.
I wonder if that means gwern is wrong when he says DALL-E 2′s problem is that the text model compresses information, and the underlying “representation” model genuinely struggles with composition and “there must be three X with only a single Y among them” type of constraints.