First one: ….yeah no, DALL-E 2 can’t count to five, it definitely doesn’t have the abstract reasoning to double areas. Image below is literally just “a horizontal row of five squares”.
Very interesting that it can’t manage to count to five. That to me is strong evidence that DALL-E’s not “constructing” the scenes it depicts. I guess it has more of a sense of relationships among scene element components? Like, “coffee shop” means there’s a window-like element, and if there’s a window element, then there’s some sort of scene through the window, and that’s probably some sort of rectangular building shape. Plausible guesses all the way down to the texture and color of skin or fur. Filling in the blanks on steroids, but with a complete lack of design or forethought.
Yeah, this matches with my sense. It has a really extensive knowledge of the expected relationships between elements, extending over a huge number of kinds of objects, and so it can (in one of the areas that are easy for it) successfully fill in the blanks in a way that looks very believable, but the extent to which it has a gears-y model of the scene seems very minimal. I think this also explains its difficulty with non-stereotypical scenes that don’t have a single focal element – if it’s filling in the blanks for both “pirate ship scene” and “dogs in Roman uniforms scene” it gets more confused.
The Elon Musk one has realistic faces so I can’t share it; I have, however, confirmed that DALL-E does not speak ASL with “The ASL word for “thank you”″:
I’m curious why this prompt resulted in overwhelmingly black looking hands. Especially considering that all the other prompts I see result in white subjects being represented. Any theories?
It’s unnatural, yes: ASL is predominantly white, and people involved in ASL are even more so (I went to NTID and the national convention, so can speak first-hand, but you can also check Google Image for that query and it’ll look like what you expect, which is amusing because ‘Deaf’ culture is so university & liberal-centric). So it’s not that ASL diagrams or photographs in the wild really do look like that—they don’t.
Overrepresentation of DEI material in the supersekrit licensed databases would be my guess. Stock photography sources are rapidly updated for fashions, particularly recent ones, and you can see this occasionally surfacing in weird queries. (An example going around Twitter which you can check for yourself: “happy white woman” in Google will turn up a lot of strange photos for what seems like a very easy straightforward query.) Which parts are causing it is a better question: I wouldn’t expect there to be much Deaf stock photo material which had been updated, or much ASL material at all, so maybe there’s bleedthrough from all of the hand-centric (eg ‘Black Power salute’, upraised Marxist fists, protests) iconography? There being so much of the latter and so little of the former that the latter becomes the default kind of hand imagery.
It must be something like that, but it still feels like there’s a hole there. The query is for “ASL”, not “Hands”, and these images don’t look like something from a protest. The top left might be vaguely similar to some kind of street gesture.
I’m curious what the role of the query writer is. Can you ask DALL-E for “this scene, but with black skin colour”? I got a sense that updating areas was possible but inconsistent. Could DALL-E learn to return more of X to a given person by receiving feedback? I really don’t know how complicated the process gets.
The query is for “ASL”, not “Hands”, and these images don’t look like something from a protest.
ASL will always be depicted by a model like DALL-E as hands; I am sure that there are non-pictorial ways to write down ASL but I can’t recall them, and I actually took ASL classes. So that query should always produce hands in it. Then because actual ASL diagrams will be rare and overwhelmed by leakage from more popular classes (keep in mind that deafness is well under 1% of the US population, even including people like me who are otherwise completely uninvolved and invisible, and basically any political fad whatsoever will rapidly produce vastly more material than even core deaf topics), and maybe some more unCLIP looseness...
This is interesting, because you’d think it would at least understand that the cup should be tipping over. Makes me think it is considering the cup and the water as two distinct objects, and doesn’t really understand that the cup tipping over would be what causes the water to spill. But it does understand that the water should be located “inside” the cup, but probably purely in a “it looks like the water is inside the cup” sense. I don’t think DALL-E seems to understand the idea of “inside” as an actual location.
I wonder if its understanding of the world is just 2D or semi-3D. Perhaps training it on photogrammetry datasets (photos of the same objects but from multiple points of view) would improve that?
Slightly reworded to “a game as complex tic-tac-toe, screenshots showing the rules of the game”, I am pretty sure DALL-E is not able to generate and model consistent game rules though.
What about the combo: a tic-tac-toe board position, a tic-tac-toe board position with X winning, and a tic-tac-toe board position with O winning. Would it give realistic positions matching the descriptions?
Challenging prompt ideas to try:
A row of five squares, in which the rightmost four squares each have twice the area of the square to their immediate left.
Screenshots from a novel game comparable in complexity to tic-tac-toe sufficient to demonstrate the rules of the game.
Elon Musk signing his own name in ASL.
The hands of a pianist as they play the first chord from Chopin’s Polonaise in Ab major, Op. 53
Pages from a flip book of a water glass spilling.
First one: ….yeah no, DALL-E 2 can’t count to five, it definitely doesn’t have the abstract reasoning to double areas. Image below is literally just “a horizontal row of five squares”.
Very interesting that it can’t manage to count to five. That to me is strong evidence that DALL-E’s not “constructing” the scenes it depicts. I guess it has more of a sense of relationships among scene element components? Like, “coffee shop” means there’s a window-like element, and if there’s a window element, then there’s some sort of scene through the window, and that’s probably some sort of rectangular building shape. Plausible guesses all the way down to the texture and color of skin or fur. Filling in the blanks on steroids, but with a complete lack of design or forethought.
Yeah, this matches with my sense. It has a really extensive knowledge of the expected relationships between elements, extending over a huge number of kinds of objects, and so it can (in one of the areas that are easy for it) successfully fill in the blanks in a way that looks very believable, but the extent to which it has a gears-y model of the scene seems very minimal. I think this also explains its difficulty with non-stereotypical scenes that don’t have a single focal element – if it’s filling in the blanks for both “pirate ship scene” and “dogs in Roman uniforms scene” it gets more confused.
You’re making my dreams come true. I really want to see the Elon Musk one :)
Edit: or the waterglass spilling. That’s the one with my most uncertainty about its performance.
The Elon Musk one has realistic faces so I can’t share it; I have, however, confirmed that DALL-E does not speak ASL with “The ASL word for “thank you”″:
We’ve got some funky fingers here. Six six fingers, a sort of double-tipped finger, an extra joint on the index finger on picture (1, 4). Fascinating.
It seems to be mostly trying to go for the “I love you” sign, perhaps because that’s one of the most commonly represented ones.
I’m curious why this prompt resulted in overwhelmingly black looking hands. Especially considering that all the other prompts I see result in white subjects being represented. Any theories?
It’s unnatural, yes: ASL is predominantly white, and people involved in ASL are even more so (I went to NTID and the national convention, so can speak first-hand, but you can also check Google Image for that query and it’ll look like what you expect, which is amusing because ‘Deaf’ culture is so university & liberal-centric). So it’s not that ASL diagrams or photographs in the wild really do look like that—they don’t.
Overrepresentation of DEI material in the supersekrit licensed databases would be my guess. Stock photography sources are rapidly updated for fashions, particularly recent ones, and you can see this occasionally surfacing in weird queries. (An example going around Twitter which you can check for yourself: “happy white woman” in Google will turn up a lot of strange photos for what seems like a very easy straightforward query.) Which parts are causing it is a better question: I wouldn’t expect there to be much Deaf stock photo material which had been updated, or much ASL material at all, so maybe there’s bleedthrough from all of the hand-centric (eg ‘Black Power salute’, upraised Marxist fists, protests) iconography? There being so much of the latter and so little of the former that the latter becomes the default kind of hand imagery.
It must be something like that, but it still feels like there’s a hole there. The query is for “ASL”, not “Hands”, and these images don’t look like something from a protest. The top left might be vaguely similar to some kind of street gesture.
I’m curious what the role of the query writer is. Can you ask DALL-E for “this scene, but with black skin colour”? I got a sense that updating areas was possible but inconsistent. Could DALL-E learn to return more of X to a given person by receiving feedback? I really don’t know how complicated the process gets.
ASL will always be depicted by a model like DALL-E as hands; I am sure that there are non-pictorial ways to write down ASL but I can’t recall them, and I actually took ASL classes. So that query should always produce hands in it. Then because actual ASL diagrams will be rare and overwhelmed by leakage from more popular classes (keep in mind that deafness is well under 1% of the US population, even including people like me who are otherwise completely uninvolved and invisible, and basically any political fad whatsoever will rapidly produce vastly more material than even core deaf topics), and maybe some more unCLIP looseness...
OA announced its new ‘reducing bias’ DALL-E 2 today. Interestingly, it appears to do so by secretly editing your prompt to inject words like ‘black’ or ‘female’.
“Pages from a flip book of a water glass spilling” I...think DALL-E 2 does not know what a flip book is.
I...think it just does not understand the physics of water spilling, period.
Relatedly, DALL-E is a little confused about how Olympic swimming is supposed to work.
This is interesting, because you’d think it would at least understand that the cup should be tipping over. Makes me think it is considering the cup and the water as two distinct objects, and doesn’t really understand that the cup tipping over would be what causes the water to spill. But it does understand that the water should be located “inside” the cup, but probably purely in a “it looks like the water is inside the cup” sense. I don’t think DALL-E seems to understand the idea of “inside” as an actual location.
I wonder if its understanding of the world is just 2D or semi-3D. Perhaps training it on photogrammetry datasets (photos of the same objects but from multiple points of view) would improve that?
Slightly reworded to “a game as complex tic-tac-toe, screenshots showing the rules of the game”, I am pretty sure DALL-E is not able to generate and model consistent game rules though.
At least it seems to have figured out we wanted a game that was not tic-tac-toe.
Depends on if it generates stuff like this if you ask it for tic-tac-toe :P
What about the combo: a tic-tac-toe board position, a tic-tac-toe board position with X winning, and a tic-tac-toe board position with O winning. Would it give realistic positions matching the descriptions?
I really doubt it but I’ll give it a try once I’m caught on on all the requested prompts here!