I always thought that it’s weird that AI struggles with text, just as in my dreams. Every time I open a book in a dream, it’s jumbled and nonsensical and I can immediately tell I’m dreaming.
That is probably some relatively uninteresting aspect of DALL-E 2 rather than a deep insight about generative models & dreaming, say. DL training is counterintuitive: did ProGAN/StyleGAN screw up meme captions in generated images, generating ‘moon’ or ‘Cyrillic writing’, because of some failure in GAN dynamics? Nah, it was just that Nvidia turned on horizontal flipping as a data augmentation, so every piece of text was seen both normally and through a mirror, so of course the GAN couldn’t figure out what was the ‘real’ appearance of writing. Disable that and like in TADNE, writing works a lot better—in TADNE, the Japanese writing looks Japanese (although Japanese confirm that it’s complete nonsense). People have been pointing out that while DALL-E does not do writing well, some of the smaller inferior competing models do do writing almost entirely too much, and they weren’t designed to generate text inside images either. Just how they came out.
Yeah, I see what you mean. But even if it gets the correct letter shapes, it’s still nonsense right? It’s writing as a visual feel, not anything actually written. Maybe in a sense writing is more difficult to do for image generators—a piece of paper with written text is much more information dense than, say, a patch of grass—especially if it’s integrated in a scene.
Low-res images like 256px images are a pretty hard source to try to learn language from! Which is not to say that no mad scientists won’t try it anyway, but you certainly can’t expect GPT-3 fluency from an image model which sees language as primarily strip mall signage or the occasional axis label, all downscaled to near unreadability. (Note that ‘captions’ of images will often, or usually, not transcribe all text visible in said image.) You might think that they would start from a pretrained small GPT-3 or something (perhaps run an OCR NN to transcribe all text in the images and append that to the caption), but they don’t seem to (at least, checking the DALL-E 2 & GLIDE paper indicates no such use and the ‘text encoder’ appears to be a diffusion model and so couldn’t be a pretrained GPT-3?). Oh well. You can’t do everything, you know.
I always thought that it’s weird that AI struggles with text, just as in my dreams. Every time I open a book in a dream, it’s jumbled and nonsensical and I can immediately tell I’m dreaming.
That is probably some relatively uninteresting aspect of DALL-E 2 rather than a deep insight about generative models & dreaming, say. DL training is counterintuitive: did ProGAN/StyleGAN screw up meme captions in generated images, generating ‘moon’ or ‘Cyrillic writing’, because of some failure in GAN dynamics? Nah, it was just that Nvidia turned on horizontal flipping as a data augmentation, so every piece of text was seen both normally and through a mirror, so of course the GAN couldn’t figure out what was the ‘real’ appearance of writing. Disable that and like in TADNE, writing works a lot better—in TADNE, the Japanese writing looks Japanese (although Japanese confirm that it’s complete nonsense). People have been pointing out that while DALL-E does not do writing well, some of the smaller inferior competing models do do writing almost entirely too much, and they weren’t designed to generate text inside images either. Just how they came out.
Yeah, I see what you mean. But even if it gets the correct letter shapes, it’s still nonsense right? It’s writing as a visual feel, not anything actually written. Maybe in a sense writing is more difficult to do for image generators—a piece of paper with written text is much more information dense than, say, a patch of grass—especially if it’s integrated in a scene.
Low-res images like 256px images are a pretty hard source to try to learn language from! Which is not to say that no mad scientists won’t try it anyway, but you certainly can’t expect GPT-3 fluency from an image model which sees language as primarily strip mall signage or the occasional axis label, all downscaled to near unreadability. (Note that ‘captions’ of images will often, or usually, not transcribe all text visible in said image.) You might think that they would start from a pretrained small GPT-3 or something (perhaps run an OCR NN to transcribe all text in the images and append that to the caption), but they don’t seem to (at least, checking the DALL-E 2 & GLIDE paper indicates no such use and the ‘text encoder’ appears to be a diffusion model and so couldn’t be a pretrained GPT-3?). Oh well. You can’t do everything, you know.