The ‘labels’ aren’t labels in the sense of being deliberately constructed in a controlled vocabulary to encode a consistent set of concepts/semantics or even be in the same language. In fact, in quite a few of the image-text pairs, the text ‘labels’ will have nothing whatsoever to do with the image—they are a meaningless ID or spammer text or mojibake or any of the infinite varieties of garbage on the Internet, and the model just has to deal with that and learn to ignore those text tokens and try to predict the image tokens purely based on available image tokens. (Note that you don’t need text ‘label’ inputs at all: you could simply train the GPT model to predict solely image tokens based on previous image tokens, in the same way GPT-2 famously predicts text tokens using previous text tokens.) So they aren’t ‘labels’ in any traditional sense. They’re just more data. You can train in the other direction to create a captioner model if you prefer, or you can drop them entirely to create a unimodal unconditional generative model. Nothing special about them the way labels are special in supervised learning. DALL-E 1 also relies critically on a VAE (the VAE is what takes the sequence of tokens predicted by GPT, and actually turns them into pixels, and which creates the sequence of real tokens which GPT was trained to predict), which was trained separately in the first phase: the VAE just trains to reconstruct images, pixels through bottleneck back to pixels, no label in sight.
The ‘labels’ aren’t labels in the sense of being deliberately constructed in a controlled vocabulary to encode a consistent set of concepts/semantics or even be in the same language. In fact, in quite a few of the image-text pairs, the text ‘labels’ will have nothing whatsoever to do with the image—they are a meaningless ID or spammer text or mojibake or any of the infinite varieties of garbage on the Internet, and the model just has to deal with that and learn to ignore those text tokens and try to predict the image tokens purely based on available image tokens. (Note that you don’t need text ‘label’ inputs at all: you could simply train the GPT model to predict solely image tokens based on previous image tokens, in the same way GPT-2 famously predicts text tokens using previous text tokens.) So they aren’t ‘labels’ in any traditional sense. They’re just more data. You can train in the other direction to create a captioner model if you prefer, or you can drop them entirely to create a unimodal unconditional generative model. Nothing special about them the way labels are special in supervised learning. DALL-E 1 also relies critically on a VAE (the VAE is what takes the sequence of tokens predicted by GPT, and actually turns them into pixels, and which creates the sequence of real tokens which GPT was trained to predict), which was trained separately in the first phase: the VAE just trains to reconstruct images, pixels through bottleneck back to pixels, no label in sight.