nostalgebraist comments on Playing with DALL·E 2

nostalgebraist 8 Apr 2022 23:51 UTC
8 points
In general all writing I’ve seen is bad. I think this is less likely to be about safety, and more that it’s hard to learn language by looking at a lot of images. However, since DE2 is trained on text, it clearly knows a lot about language at some level—I would expect there’s plenty of data to put out coherent text. Instead it outputs nonsense, focusing on getting the fonts and the background right.
It’s definitely possible to get a diffusion model to write the text from a prompt into an image. I made a model that does this late last year. (blogpost / example outputs*)
The text-conditioning mechanism (cross-attention) I use is a little different from the ones in GLIDE and DALLE-2, but I doubt this makes a huge difference.
I’m actually a little surprised that the OpenAI models don’t learn to write coherent text, since they’re bigger than mine, trained for longer on more data.
But then, I’m much more focused on this one specific capability, so I make it easy for the model: an entire ~50% of my training images have text in them, and the “prompt” in my setup always contains an automatic transcript of the text in the image (if any), never a description, or a description that happens to quote a transcript, or a description that merely summarizes the text, etc.
The OpenAI models have to solve a more abstract version of the problem, and the problem is relevant to (I would imagine) a much smaller fraction of their training examples.
*check the alt text if you want to know what text the model is attempting to write