The idea that conditioning on unCLIP-produced image vectors instead of text vectors would improve diversity seems very bewildering. And I really have a hard time swallowing the explanation “maybe this happens because for a given CLIP image vector v, there’s a large equivalence class of images that all approximately encode to v.” After all, this explanation doesn’t actually have anything to do with conditioning on image vs. text vectors; in other words, whether we condition on image or text vectors, the final resulting image should still have a large equivalence class of images which encode to approximately the same image vector.
(Unless the relevant thing is that the image vector on which the image generator conditions has a large equivalence class of images that encode to the same vector? But it’s not clear to me how that should be relevant.)
Spitballing other possible explanations, is it possible that GLIDE’s lack of diversity comes from conditioning on a variable-length tokenization of the caption (which is potentially able to pack in a diversity-killing amount of information, relative to a fixed-length vector)? I’m not really happy with this explanation either. For instance, it seems like maybe the model that produced the roses on the left-hand side of the diversity-fidelity figure was also given a variable-length encoding of the caption? I’m having a hard time telling from what’s written in the paper,[1] but if so, that would sink this explanation.
The apparent ambiguity about whether the image generator in this model was conditioned on a variable-length tokenization of the caption (in addition to the image vector produced by unCLIP) is probably a reading comprehension issue on my part. I’d appreciate the help of anyone who can resolve this one way or the other.
I completely agree that the effects of using unCLIP are mysterious, in fact the opposite of what I’d predict them to be.
I wish the paper had said more about why they tried unCLIP in the first place, and what improvements they predicted they would get from it. It took me a long time just to figure out why the idea might be worth trying at all, and even now, I would never have predicted the effects it had in practice. If OpenAI predicted them, then they know something I don’t.
For instance, it seems like maybe the model that produced the roses on the left-hand side of the diversity-fidelity figure was also given a variable-length encoding of the caption? I’m having a hard time telling from what’s written in the paper
Yes, that model did get to see a variable-length encoding of the caption. As far as I can tell, the paper never tries a model that only has a CLIP vector available, with no sequential pathway.
Again, it’s very mysterious that (GLIDE’s pathway + unCLIP pathway) would increase diversity over GLIDE, since these models are given strictlymoreinformation to condition on!
(Low-confidence guess follows. The generator view the sequential representation in its attention layers, and in the new model, these layers are also given a version of the CLIP vector, as four “tokens,” each a different projection of the vector. [The same vector is also, separately, added to the model’s more global “embedding” stream.] In attention, there is competitive inhibition between looking at one position, and looking at another. So, it’s conceivable that the CLIP “tokens” are so information-rich that the attention fixates on them, ignoring the text-sequence tokens. If so, it would ignore some information that GLIDE does not ignore.)
It’s also noteworthy that they mention the (much more obvious) idea of conditioning solely on CLIP text vectors, citing Katherine Crawson’s work:
Building on this observation, another approach would be to train the decoder to condition on CLIP text embeddings [9] instead of CLIP image embeddings
...but they never actually try this out in a head-to-head comparison. For all we know, a model conditioned on CLIP text vectors, trained with GLIDE’s scale and data, would do better than GLIDE and unCLIP. Certainly nothing in the paper rules out this possibility.
The idea that conditioning on unCLIP-produced image vectors instead of text vectors would improve diversity seems very bewildering. And I really have a hard time swallowing the explanation “maybe this happens because for a given CLIP image vector v, there’s a large equivalence class of images that all approximately encode to v.” After all, this explanation doesn’t actually have anything to do with conditioning on image vs. text vectors; in other words, whether we condition on image or text vectors, the final resulting image should still have a large equivalence class of images which encode to approximately the same image vector.
(Unless the relevant thing is that the image vector on which the image generator conditions has a large equivalence class of images that encode to the same vector? But it’s not clear to me how that should be relevant.)
Spitballing other possible explanations, is it possible that GLIDE’s lack of diversity comes from conditioning on a variable-length tokenization of the caption (which is potentially able to pack in a diversity-killing amount of information, relative to a fixed-length vector)? I’m not really happy with this explanation either. For instance, it seems like maybe the model that produced the roses on the left-hand side of the diversity-fidelity figure was also given a variable-length encoding of the caption? I’m having a hard time telling from what’s written in the paper,[1] but if so, that would sink this explanation.
The apparent ambiguity about whether the image generator in this model was conditioned on a variable-length tokenization of the caption (in addition to the image vector produced by unCLIP) is probably a reading comprehension issue on my part. I’d appreciate the help of anyone who can resolve this one way or the other.
I completely agree that the effects of using unCLIP are mysterious, in fact the opposite of what I’d predict them to be.
I wish the paper had said more about why they tried unCLIP in the first place, and what improvements they predicted they would get from it. It took me a long time just to figure out why the idea might be worth trying at all, and even now, I would never have predicted the effects it had in practice. If OpenAI predicted them, then they know something I don’t.
Yes, that model did get to see a variable-length encoding of the caption. As far as I can tell, the paper never tries a model that only has a CLIP vector available, with no sequential pathway.
Again, it’s very mysterious that (GLIDE’s pathway + unCLIP pathway) would increase diversity over GLIDE, since these models are given strictly more information to condition on!
(Low-confidence guess follows. The generator view the sequential representation in its attention layers, and in the new model, these layers are also given a version of the CLIP vector, as four “tokens,” each a different projection of the vector. [The same vector is also, separately, added to the model’s more global “embedding” stream.] In attention, there is competitive inhibition between looking at one position, and looking at another. So, it’s conceivable that the CLIP “tokens” are so information-rich that the attention fixates on them, ignoring the text-sequence tokens. If so, it would ignore some information that GLIDE does not ignore.)
It’s also noteworthy that they mention the (much more obvious) idea of conditioning solely on CLIP text vectors, citing Katherine Crawson’s work:
...but they never actually try this out in a head-to-head comparison. For all we know, a model conditioned on CLIP text vectors, trained with GLIDE’s scale and data, would do better than GLIDE and unCLIP. Certainly nothing in the paper rules out this possibility.