The text-to-image from Dall-E was based on another model called CLIP, which had learned to caption images (generate image-to-text). This captioning could be thought as supervised learning, but the caveat is that they weren’t labeled by humans (in the ML sense) but extracted from web data. This is just a part of the Dall-E model, another one is the diffusion process that is based on recovering an image from noise, which is un-supervised as we can just add noise to images and ask it to recover the original image.
The ‘labels’ aren’t labels in the sense of being deliberately constructed in a controlled vocabulary to encode a consistent set of concepts/semantics or even be in the same language. In fact, in quite a few of the image-text pairs, the text ‘labels’ will have nothing whatsoever to do with the image—they are a meaningless ID or spammer text or mojibake or any of the infinite varieties of garbage on the Internet, and the model just has to deal with that and learn to ignore those text tokens and try to predict the image tokens purely based on available image tokens. (Note that you don’t need text ‘label’ inputs at all: you could simply train the GPT model to predict solely image tokens based on previous image tokens, in the same way GPT-2 famously predicts text tokens using previous text tokens.) So they aren’t ‘labels’ in any traditional sense. They’re just more data. You can train in the other direction to create a captioner model if you prefer, or you can drop them entirely to create a unimodal unconditional generative model. Nothing special about them the way labels are special in supervised learning. DALL-E 1 also relies critically on a VAE (the VAE is what takes the sequence of tokens predicted by GPT, and actually turns them into pixels, and which creates the sequence of real tokens which GPT was trained to predict), which was trained separately in the first phase: the VAE just trains to reconstruct images, pixels through bottleneck back to pixels, no label in sight.
Not sure on DALL-E, but I think many image generators use an image classifier as part of their process. The classifier uses labels for its training, but the image AI doesn’t have direct intervention.
I think you take the classifier like CLIP and run it on an image to tell you it is likely “car” and “ red”. Then add noise to the image. Then provide the noisy image and classifications to the image AI. So it will try to find “red” and “car” and add more of it to the details. Then the resulting image is run through CLIP and the classifications compared to the original classifications to define the loss function.
Just like language models are trained using masked language modelling and next token prediction, Dall-E was trained for image inpainting(predicting cropped-out parts of an image). This doesn’t require explicit labels; hence it’s self-supervised learning. Note this is only a part of the training procedure, which is self-supervised and not the whole training process.
How was Dall-E based on self-supervised learning? The datasets of images weren’t labeled by humans? If not, how does it get form text to image?
The text-to-image from Dall-E was based on another model called CLIP, which had learned to caption images (generate image-to-text). This captioning could be thought as supervised learning, but the caveat is that they weren’t labeled by humans (in the ML sense) but extracted from web data. This is just a part of the Dall-E model, another one is the diffusion process that is based on recovering an image from noise, which is un-supervised as we can just add noise to images and ask it to recover the original image.
The ‘labels’ aren’t labels in the sense of being deliberately constructed in a controlled vocabulary to encode a consistent set of concepts/semantics or even be in the same language. In fact, in quite a few of the image-text pairs, the text ‘labels’ will have nothing whatsoever to do with the image—they are a meaningless ID or spammer text or mojibake or any of the infinite varieties of garbage on the Internet, and the model just has to deal with that and learn to ignore those text tokens and try to predict the image tokens purely based on available image tokens. (Note that you don’t need text ‘label’ inputs at all: you could simply train the GPT model to predict solely image tokens based on previous image tokens, in the same way GPT-2 famously predicts text tokens using previous text tokens.) So they aren’t ‘labels’ in any traditional sense. They’re just more data. You can train in the other direction to create a captioner model if you prefer, or you can drop them entirely to create a unimodal unconditional generative model. Nothing special about them the way labels are special in supervised learning. DALL-E 1 also relies critically on a VAE (the VAE is what takes the sequence of tokens predicted by GPT, and actually turns them into pixels, and which creates the sequence of real tokens which GPT was trained to predict), which was trained separately in the first phase: the VAE just trains to reconstruct images, pixels through bottleneck back to pixels, no label in sight.
Not sure on DALL-E, but I think many image generators use an image classifier as part of their process. The classifier uses labels for its training, but the image AI doesn’t have direct intervention.
I think you take the classifier like CLIP and run it on an image to tell you it is likely “car” and “ red”. Then add noise to the image. Then provide the noisy image and classifications to the image AI. So it will try to find “red” and “car” and add more of it to the details. Then the resulting image is run through CLIP and the classifications compared to the original classifications to define the loss function.
Just like language models are trained using masked language modelling and next token prediction, Dall-E was trained for image inpainting(predicting cropped-out parts of an image). This doesn’t require explicit labels; hence it’s self-supervised learning. Note this is only a part of the training procedure, which is self-supervised and not the whole training process.