This is interesting, because you’d think it would at least understand that the cup should be tipping over. Makes me think it is considering the cup and the water as two distinct objects, and doesn’t really understand that the cup tipping over would be what causes the water to spill. But it does understand that the water should be located “inside” the cup, but probably purely in a “it looks like the water is inside the cup” sense. I don’t think DALL-E seems to understand the idea of “inside” as an actual location.
I wonder if its understanding of the world is just 2D or semi-3D. Perhaps training it on photogrammetry datasets (photos of the same objects but from multiple points of view) would improve that?
This is interesting, because you’d think it would at least understand that the cup should be tipping over. Makes me think it is considering the cup and the water as two distinct objects, and doesn’t really understand that the cup tipping over would be what causes the water to spill. But it does understand that the water should be located “inside” the cup, but probably purely in a “it looks like the water is inside the cup” sense. I don’t think DALL-E seems to understand the idea of “inside” as an actual location.
I wonder if its understanding of the world is just 2D or semi-3D. Perhaps training it on photogrammetry datasets (photos of the same objects but from multiple points of view) would improve that?