Using CLIP is a pretty weird way to go. It’s like using a CNN classifier to generate images: it can be done, but like a dog walking, we’re more surprised to see it work at all.
If you think about how a contrastive loss works, it’s perhaps less surprising why CLIP-guided images look the way they do, and do things like try to repeat an object many time: if you have a prompt like “Mickey Mouse”, what could be even more Mickey-Mouse-y than Mickey Mouse tiled a dozen times? That surely maximizes its embedding encoding ‘Mickey Mouse’, and its distance from non-Disney-related image embeddings like, say, “dog” or “Empire State Building”! Whether you can really induce sharp coherent images from any scaled-up CLIP like ALIGN is unclear: contrastive losses just might not learn these things, and the generative model needs to do all the work. No matter how exquisitely accurately a contrastive model learns every detail about Mickey Mouse, it seems like it’d still be the case that a dozen Mickey Mouses tiled together is ‘more Mickey-Mouse-y’ than a single beautifully coherent Mickey Mouse.
(CLIP would still remain useful as an extremely handy way to take any image generative model and quickly turn it into an text-editable model, or possible text->image model. One could probably do better by an explicit approach like DALL-E, but CLIP is available now and DALL-E is not.)
If you are interested in SOTA image generation quality rather than weird CLIP hacks, you should look at:
Using CLIP is a pretty weird way to go. It’s like using a CNN classifier to generate images: it can be done, but like a dog walking, we’re more surprised to see it work at all.
If you think about how a contrastive loss works, it’s perhaps less surprising why CLIP-guided images look the way they do, and do things like try to repeat an object many time: if you have a prompt like “Mickey Mouse”, what could be even more Mickey-Mouse-y than Mickey Mouse tiled a dozen times? That surely maximizes its embedding encoding ‘Mickey Mouse’, and its distance from non-Disney-related image embeddings like, say, “dog” or “Empire State Building”! Whether you can really induce sharp coherent images from any scaled-up CLIP like ALIGN is unclear: contrastive losses just might not learn these things, and the generative model needs to do all the work. No matter how exquisitely accurately a contrastive model learns every detail about Mickey Mouse, it seems like it’d still be the case that a dozen Mickey Mouses tiled together is ‘more Mickey-Mouse-y’ than a single beautifully coherent Mickey Mouse.
(CLIP would still remain useful as an extremely handy way to take any image generative model and quickly turn it into an text-editable model, or possible text->image model. One could probably do better by an explicit approach like DALL-E, but CLIP is available now and DALL-E is not.)
If you are interested in SOTA image generation quality rather than weird CLIP hacks, you should look at:
“VQGAN: Taming Transformers for High-Resolution Image Synthesis”, Esser et al 2020
DALL-E/CogView
“SR3: Image Super-Resolution via Iterative Refinement”, Saharia et al 2021 / “Diffusion Models Beat GANs on Image Synthesis”, Dhariwal & Nichol 2021
DenseFlow
“Alias-Free GAN (Generative Adversarial Networks)”, Karras et al 2021