Apparently, CLIP text vectors are very distinguishable from CLIP image vectors. I don’t think this should be surprising. Text vectors aren’t actually expressing images, after all, they’re expressing probability distributions over images.
They are more closely analogous to the outputs of a GPT model’s final layer than they are to individual tokens from its vocab. The output of GPT’s final layer doesn’t “look like” the embedding of a single token, nor should it. Often the model wants to spread its probability mass across a range of alternatives.
Except even that analogy isn’t quite right, because CLIP’s image vectors aren’t “images,” either—they’re probability distributions over captions. It’s not obvious that distributions-over-captions would be a better type of input for your image generator than distributions-over-images.
Also note that CLIP has a trainable parameter, the “logit scale,” which multiplies the cosine similarities before the softmax. So the overall scale of the cosine similarities is arbitrary (as is their maximum value). CLIP doesn’t “need” the similarities to span any particular range. A similarity value like 0.4 doesn’t mean anything on its about about how close CLIP thinks the match is. That’s determined by (similarity * logit scale).
This chart showing 10,000 CLIP text embeddings and 10,000 CLIP image embeddings might give insights about the utility of the “prior” neural networks: https://twitter.com/metasemantic/status/1356406256802607112.
Hmm… what moral are you drawing from that result?
Apparently, CLIP text vectors are very distinguishable from CLIP image vectors. I don’t think this should be surprising. Text vectors aren’t actually expressing images, after all, they’re expressing probability distributions over images.
They are more closely analogous to the outputs of a GPT model’s final layer than they are to individual tokens from its vocab. The output of GPT’s final layer doesn’t “look like” the embedding of a single token, nor should it. Often the model wants to spread its probability mass across a range of alternatives.
Except even that analogy isn’t quite right, because CLIP’s image vectors aren’t “images,” either—they’re probability distributions over captions. It’s not obvious that distributions-over-captions would be a better type of input for your image generator than distributions-over-images.
Also note that CLIP has a trainable parameter, the “logit scale,” which multiplies the cosine similarities before the softmax. So the overall scale of the cosine similarities is arbitrary (as is their maximum value). CLIP doesn’t “need” the similarities to span any particular range. A similarity value like 0.4 doesn’t mean anything on its about about how close CLIP thinks the match is. That’s determined by (similarity * logit scale).