gwern comments on What DALL-E 2 can and cannot do

gwern 7 Jul 2022 1:22 UTC
4 points
I agree. What sort of images would it even be trained on in the first place which would allow that? It can’t train on a big montage or landscape shot because the dimensions are wrong and the core model is trained on very small samples to boot, with upscalers handling most of the pixel generation. I would check Google & Yandex image search to see if there are any photographs online with the two cabins in the same photograph which could hypothetically enable that. I would also try using the closest street addresses to see if one can prompt it directly, since that is likely what would be in the text caption of hypothetical images. Also, testing photograph rather than watercolor is an obvious change. A more stringent test would be to do inpainting/uncropping of photographs of both: if it really does ‘know’, it should be highly likely to fill in the other cabin in the right location and surroundings when you ‘pan left’ or whatever. Otherwise, ‘cabins’ are a fairly stereotypical kind of architecture and it just got lucky. OA says DALL-E 2 is well into the low millions of images generated and climbing as fast as overloaded GPUs can spit them out (<=50 completions per day per >30k invited people thus far...), so we’re not even appealing that hard to chance here.