CLIP’s task is to invent a notation system that can express the essence of (1) any possible picture, and (2) any possible description of a picture, in only a brief list of maybe 512 or 1024 floating-point numbers.
How many bits is this? 2KiB / 16Kib? Other?
Has there been any work in using this or something similar as the basis of a high-compression compression scheme? Compression and decompression speed would be horrendous, but still.
Hm. I wonder what would happen if you trained a version on internet images and its own image errors. I suspect it might not converge, but it’s still interesting to think about.
Assuming it did converge, take the final trained version and do the following:
Encode the image.
Take the difference between the image and output. Encode that.
Take the difference between the image and output + delta output. Encode that.
Repeat until you get to the desired bitrate or error rate.
Ah, I now realize that I was kind of misleading in the sentence you quoted. (Sorry about that.)
I made it sound like CLIP was doing image compression. And there are ML models that are trained, directly and literally to do image compression in a more familiar sense, trying to get the pixel values as close to the original as possible. These are the image autoencoders.
DALLE-2 doesn’t use an autoencoder, but many other popular image generators do, such as VQGAN and the original DALLE.
So for example, the original DALLE has an autoencoder component which can compress and decompress 256x256 images. Its compressed representation is a 32x32 array, where each cell takes a discrete value from 8192 possible values. This is 13 bits per cell (if you don’t do any further compression like RLE on it), so you end up with 13 KiB per image. And then DALLE “writes” in this code the same way GPT writes text.
CLIP, though, is not an autoencoder, because it never has to decompress its representation back into an image. (That’s what unCLIP does, but the CLIP encoding was not made “with the knowledge” that unCLIP would later come along and try to do this; CLIP was never encouraged to make is code to be especially suitable for this purpose.)
Instead, CLIP is trying to capture . . . “everything about an image that could be relevant to matching it with a caption.”
In some sense this is just image compression, because in principle the caption could mention literally any property of the image.But lossy compression always has to choose something to sacrifice, and CLIP’s priorities are very different from the compressors we’re more familiar with. They care about preserving pixel values, so they care a lot about details. CLIP cares about matching with short (<= ~70 word) captions, so it cares almost entirely about high-level semantic features.
(Disclaimer: not my forte.)
How many bits is this? 2KiB / 16Kib? Other?
Has there been any work in using this or something similar as the basis of a high-compression compression scheme? Compression and decompression speed would be horrendous, but still.
Hm. I wonder what would happen if you trained a version on internet images and its own image errors. I suspect it might not converge, but it’s still interesting to think about.
Assuming it did converge, take the final trained version and do the following:
Encode the image.
Take the difference between the image and output. Encode that.
Take the difference between the image and output + delta output. Encode that.
Repeat until you get to the desired bitrate or error rate.
Ah, I now realize that I was kind of misleading in the sentence you quoted. (Sorry about that.)
I made it sound like CLIP was doing image compression. And there are ML models that are trained, directly and literally to do image compression in a more familiar sense, trying to get the pixel values as close to the original as possible. These are the image autoencoders.
DALLE-2 doesn’t use an autoencoder, but many other popular image generators do, such as VQGAN and the original DALLE.
So for example, the original DALLE has an autoencoder component which can compress and decompress 256x256 images. Its compressed representation is a 32x32 array, where each cell takes a discrete value from 8192 possible values. This is 13 bits per cell (if you don’t do any further compression like RLE on it), so you end up with 13 KiB per image. And then DALLE “writes” in this code the same way GPT writes text.
CLIP, though, is not an autoencoder, because it never has to decompress its representation back into an image. (That’s what unCLIP does, but the CLIP encoding was not made “with the knowledge” that unCLIP would later come along and try to do this; CLIP was never encouraged to make is code to be especially suitable for this purpose.)
Instead, CLIP is trying to capture . . . “everything about an image that could be relevant to matching it with a caption.”
In some sense this is just image compression, because in principle the caption could mention literally any property of the image.But lossy compression always has to choose something to sacrifice, and CLIP’s priorities are very different from the compressors we’re more familiar with. They care about preserving pixel values, so they care a lot about details. CLIP cares about matching with short (<= ~70 word) captions, so it cares almost entirely about high-level semantic features.