These are very impressive! It looks like it gets the concepts, but lacks global coherency.
Could anyone comment on how far we are from results of similar quality as the training set? Can we expect better results just by scaling up the generator or CLIP?
Using CLIP is a pretty weird way to go. It’s like using a CNN classifier to generate images: it can be done, but like a dog walking, we’re more surprised to see it work at all.
If you think about how a contrastive loss works, it’s perhaps less surprising why CLIP-guided images look the way they do, and do things like try to repeat an object many time: if you have a prompt like “Mickey Mouse”, what could be even more Mickey-Mouse-y than Mickey Mouse tiled a dozen times? That surely maximizes its embedding encoding ‘Mickey Mouse’, and its distance from non-Disney-related image embeddings like, say, “dog” or “Empire State Building”! Whether you can really induce sharp coherent images from any scaled-up CLIP like ALIGN is unclear: contrastive losses just might not learn these things, and the generative model needs to do all the work. No matter how exquisitely accurately a contrastive model learns every detail about Mickey Mouse, it seems like it’d still be the case that a dozen Mickey Mouses tiled together is ‘more Mickey-Mouse-y’ than a single beautifully coherent Mickey Mouse.
(CLIP would still remain useful as an extremely handy way to take any image generative model and quickly turn it into an text-editable model, or possible text->image model. One could probably do better by an explicit approach like DALL-E, but CLIP is available now and DALL-E is not.)
If you are interested in SOTA image generation quality rather than weird CLIP hacks, you should look at:
These are very impressive! It looks like it gets the concepts, but lacks global coherency.
Could anyone comment on how far we are from results of similar quality as the training set? Can we expect better results just by scaling up the generator or CLIP?
Using CLIP is a pretty weird way to go. It’s like using a CNN classifier to generate images: it can be done, but like a dog walking, we’re more surprised to see it work at all.
If you think about how a contrastive loss works, it’s perhaps less surprising why CLIP-guided images look the way they do, and do things like try to repeat an object many time: if you have a prompt like “Mickey Mouse”, what could be even more Mickey-Mouse-y than Mickey Mouse tiled a dozen times? That surely maximizes its embedding encoding ‘Mickey Mouse’, and its distance from non-Disney-related image embeddings like, say, “dog” or “Empire State Building”! Whether you can really induce sharp coherent images from any scaled-up CLIP like ALIGN is unclear: contrastive losses just might not learn these things, and the generative model needs to do all the work. No matter how exquisitely accurately a contrastive model learns every detail about Mickey Mouse, it seems like it’d still be the case that a dozen Mickey Mouses tiled together is ‘more Mickey-Mouse-y’ than a single beautifully coherent Mickey Mouse.
(CLIP would still remain useful as an extremely handy way to take any image generative model and quickly turn it into an text-editable model, or possible text->image model. One could probably do better by an explicit approach like DALL-E, but CLIP is available now and DALL-E is not.)
If you are interested in SOTA image generation quality rather than weird CLIP hacks, you should look at:
“VQGAN: Taming Transformers for High-Resolution Image Synthesis”, Esser et al 2020
DALL-E/CogView
“SR3: Image Super-Resolution via Iterative Refinement”, Saharia et al 2021 / “Diffusion Models Beat GANs on Image Synthesis”, Dhariwal & Nichol 2021
DenseFlow
“Alias-Free GAN (Generative Adversarial Networks)”, Karras et al 2021