Very interesting that it can’t manage to count to five. That to me is strong evidence that DALL-E’s not “constructing” the scenes it depicts. I guess it has more of a sense of relationships among scene element components? Like, “coffee shop” means there’s a window-like element, and if there’s a window element, then there’s some sort of scene through the window, and that’s probably some sort of rectangular building shape. Plausible guesses all the way down to the texture and color of skin or fur. Filling in the blanks on steroids, but with a complete lack of design or forethought.
Yeah, this matches with my sense. It has a really extensive knowledge of the expected relationships between elements, extending over a huge number of kinds of objects, and so it can (in one of the areas that are easy for it) successfully fill in the blanks in a way that looks very believable, but the extent to which it has a gears-y model of the scene seems very minimal. I think this also explains its difficulty with non-stereotypical scenes that don’t have a single focal element – if it’s filling in the blanks for both “pirate ship scene” and “dogs in Roman uniforms scene” it gets more confused.
Very interesting that it can’t manage to count to five. That to me is strong evidence that DALL-E’s not “constructing” the scenes it depicts. I guess it has more of a sense of relationships among scene element components? Like, “coffee shop” means there’s a window-like element, and if there’s a window element, then there’s some sort of scene through the window, and that’s probably some sort of rectangular building shape. Plausible guesses all the way down to the texture and color of skin or fur. Filling in the blanks on steroids, but with a complete lack of design or forethought.
Yeah, this matches with my sense. It has a really extensive knowledge of the expected relationships between elements, extending over a huge number of kinds of objects, and so it can (in one of the areas that are easy for it) successfully fill in the blanks in a way that looks very believable, but the extent to which it has a gears-y model of the scene seems very minimal. I think this also explains its difficulty with non-stereotypical scenes that don’t have a single focal element – if it’s filling in the blanks for both “pirate ship scene” and “dogs in Roman uniforms scene” it gets more confused.