I completely agree that the effects of using unCLIP are mysterious, in fact the opposite of what I’d predict them to be.
I wish the paper had said more about why they tried unCLIP in the first place, and what improvements they predicted they would get from it. It took me a long time just to figure out why the idea might be worth trying at all, and even now, I would never have predicted the effects it had in practice. If OpenAI predicted them, then they know something I don’t.
For instance, it seems like maybe the model that produced the roses on the left-hand side of the diversity-fidelity figure was also given a variable-length encoding of the caption? I’m having a hard time telling from what’s written in the paper
Yes, that model did get to see a variable-length encoding of the caption. As far as I can tell, the paper never tries a model that only has a CLIP vector available, with no sequential pathway.
Again, it’s very mysterious that (GLIDE’s pathway + unCLIP pathway) would increase diversity over GLIDE, since these models are given strictlymoreinformation to condition on!
(Low-confidence guess follows. The generator view the sequential representation in its attention layers, and in the new model, these layers are also given a version of the CLIP vector, as four “tokens,” each a different projection of the vector. [The same vector is also, separately, added to the model’s more global “embedding” stream.] In attention, there is competitive inhibition between looking at one position, and looking at another. So, it’s conceivable that the CLIP “tokens” are so information-rich that the attention fixates on them, ignoring the text-sequence tokens. If so, it would ignore some information that GLIDE does not ignore.)
It’s also noteworthy that they mention the (much more obvious) idea of conditioning solely on CLIP text vectors, citing Katherine Crawson’s work:
Building on this observation, another approach would be to train the decoder to condition on CLIP text embeddings [9] instead of CLIP image embeddings
...but they never actually try this out in a head-to-head comparison. For all we know, a model conditioned on CLIP text vectors, trained with GLIDE’s scale and data, would do better than GLIDE and unCLIP. Certainly nothing in the paper rules out this possibility.
I completely agree that the effects of using unCLIP are mysterious, in fact the opposite of what I’d predict them to be.
I wish the paper had said more about why they tried unCLIP in the first place, and what improvements they predicted they would get from it. It took me a long time just to figure out why the idea might be worth trying at all, and even now, I would never have predicted the effects it had in practice. If OpenAI predicted them, then they know something I don’t.
Yes, that model did get to see a variable-length encoding of the caption. As far as I can tell, the paper never tries a model that only has a CLIP vector available, with no sequential pathway.
Again, it’s very mysterious that (GLIDE’s pathway + unCLIP pathway) would increase diversity over GLIDE, since these models are given strictly more information to condition on!
(Low-confidence guess follows. The generator view the sequential representation in its attention layers, and in the new model, these layers are also given a version of the CLIP vector, as four “tokens,” each a different projection of the vector. [The same vector is also, separately, added to the model’s more global “embedding” stream.] In attention, there is competitive inhibition between looking at one position, and looking at another. So, it’s conceivable that the CLIP “tokens” are so information-rich that the attention fixates on them, ignoring the text-sequence tokens. If so, it would ignore some information that GLIDE does not ignore.)
It’s also noteworthy that they mention the (much more obvious) idea of conditioning solely on CLIP text vectors, citing Katherine Crawson’s work:
...but they never actually try this out in a head-to-head comparison. For all we know, a model conditioned on CLIP text vectors, trained with GLIDE’s scale and data, would do better than GLIDE and unCLIP. Certainly nothing in the paper rules out this possibility.