I disagree with Wentworth here: faces are easy. That’s why they were the first big success of neural net generative modeling. They are a uniform object usually oriented the same way with a reliable number of features like 2 eyes, 1 nose, 1 mouth, and 2 ears. (Whereas with hands, essentially nothing can be counted on, not the orientation, not the number of hands nor fingers, nor their relationships. And it’s unsurprising that we are so often blind to serious errors in illustrations like having two left hands or two left feet.) Humans are hyperalert to the existence of, but highly forgiving about the realism, where it comes to faces :-)
This is why face feature detectors were so easy to create many decades ago. And remember ProGAN & StyleGAN: generating faces that people struggled to distinguish from real was easy for GANs, and people rate GAN faces as ‘more trustworthy’ etc. (Generally, you could only tell by looking at the parts which weren’t faces, like the earrings, necklaces, or backgrounds, and being suspicious if the face was centered & aligned to the 3 key points of the Nvidia dataset.) For a datapoint, I would note that when we fed cropped images of just hands into TADNE & BigGAN, we never noticed that they were generating flawless hands, although the faces were fine. Or more recently, when SD first came out, people loved the faces… and it was the hands that screwed up the images, not the faces. The faces were usually fine.
If DALL-E 3 has faces almost as bad as its hands (although I haven’t spent much time trying to generate photorealistic faces personally, I haven’t noticed any shortage in the DALL-E subreddits), that probably isn’t due to faces being as intrinsically hard as hands. They aren’t. Nor can it be due to any intrinsic lack of data in online image scrapes—if there is one thing that is in truly absurd abundance online, it is images of human faces! (Particularly selfies.)
OA in the past has screwed around with the training data for DALL-Es, like DALL-E 2′s inability to generate anime (DALL-E 3 isn’t all that good either), so I would guess that any face-blindness was introduced as part of the data filtering & processing. Faces are very PII and politically-charged. (Consider how ImageNet was recently bowdlerized to erase all human faces in it! Talk about horses & barns...) OA has been trying to avoid copyrighted images or celebrities or politicians, so it would be logical for them to do things like run face recognition software, and throw out any images which contain any face which is loosely near any human with, say, a Wikipedia article which has a photo. They might go so far as to try to erase faces or drop images with too-large faces.
So here was my initial quick test, I haven’t spent much time on this either, but have seen the same images of faces on subreddits etc. and been v impressed. I think asking for emotions was a harder challenge vs just making a believable face/hand, oops
I really appreciate your descriptions of the distinctive features of faces and of pareidolia, and do agree that faces are more often better represented than hands, specifically hands often have the more significant/notable issues (misshapen/missing/overlapped fingers). Versus with faces where there’s nothing as significant as missing an eye, but it can be hard to portray something more specific like an emotion (though same can be said for, e.g. getting Dalle not to flip me off when I ask for an index finger haha).
Rather difficult to label or prompt a specific hand orientation you’d like as well, versus I suppose, an emotion (a lot more descriptive words for the orientation of a face than a hand)
So yeah, faces do work, and regardless of my thoughts on uncanny valley of some faces+emotions, I actually do think hands (OP subject) are mostly a geometric complexity thing, maybe we see our own hands so much that we are more sensitive to error? But they don’t have the same meaning to them as faces for me (minute differences for slightly different emotions, and benefitting perhaps from being able to accurately tell).
FWIW, I would distinguish between the conditional task of ‘generating a hand/face accurately matching a particular natural language description’ and the unconditional task of ‘generating hands/faces’. A model can be good at unconditional generation but then bad at conditional generation because they, say, have a weak LLM or they use BPE tokenization or the description is too long. A model may know perfectly well how to model hands in many positions but then just not handle language perfectly well. One interesting recent paper on the sometimes very different levels of capabilities depending on the directions you’re going in modalities: “The Generative AI Paradox: “What It Can Create, It May Not Understand”″, West et al 2023.
I disagree with Wentworth here: faces are easy. That’s why they were the first big success of neural net generative modeling. They are a uniform object usually oriented the same way with a reliable number of features like 2 eyes, 1 nose, 1 mouth, and 2 ears. (Whereas with hands, essentially nothing can be counted on, not the orientation, not the number of hands nor fingers, nor their relationships. And it’s unsurprising that we are so often blind to serious errors in illustrations like having two left hands or two left feet.) Humans are hyperalert to the existence of, but highly forgiving about the realism, where it comes to faces :-)
This is why face feature detectors were so easy to create many decades ago. And remember ProGAN & StyleGAN: generating faces that people struggled to distinguish from real was easy for GANs, and people rate GAN faces as ‘more trustworthy’ etc. (Generally, you could only tell by looking at the parts which weren’t faces, like the earrings, necklaces, or backgrounds, and being suspicious if the face was centered & aligned to the 3 key points of the Nvidia dataset.) For a datapoint, I would note that when we fed cropped images of just hands into TADNE & BigGAN, we never noticed that they were generating flawless hands, although the faces were fine. Or more recently, when SD first came out, people loved the faces… and it was the hands that screwed up the images, not the faces. The faces were usually fine.
If DALL-E 3 has faces almost as bad as its hands (although I haven’t spent much time trying to generate photorealistic faces personally, I haven’t noticed any shortage in the DALL-E subreddits), that probably isn’t due to faces being as intrinsically hard as hands. They aren’t. Nor can it be due to any intrinsic lack of data in online image scrapes—if there is one thing that is in truly absurd abundance online, it is images of human faces! (Particularly selfies.)
OA in the past has screwed around with the training data for DALL-Es, like DALL-E 2′s inability to generate anime (DALL-E 3 isn’t all that good either), so I would guess that any face-blindness was introduced as part of the data filtering & processing. Faces are very PII and politically-charged. (Consider how ImageNet was recently bowdlerized to erase all human faces in it! Talk about horses & barns...) OA has been trying to avoid copyrighted images or celebrities or politicians, so it would be logical for them to do things like run face recognition software, and throw out any images which contain any face which is loosely near any human with, say, a Wikipedia article which has a photo. They might go so far as to try to erase faces or drop images with too-large faces.
So here was my initial quick test, I haven’t spent much time on this either, but have seen the same images of faces on subreddits etc. and been v impressed. I think asking for emotions was a harder challenge vs just making a believable face/hand, oops
I really appreciate your descriptions of the distinctive features of faces and of pareidolia, and do agree that faces are more often better represented than hands, specifically hands often have the more significant/notable issues (misshapen/missing/overlapped fingers). Versus with faces where there’s nothing as significant as missing an eye, but it can be hard to portray something more specific like an emotion (though same can be said for, e.g. getting Dalle not to flip me off when I ask for an index finger haha).
Rather difficult to label or prompt a specific hand orientation you’d like as well, versus I suppose, an emotion (a lot more descriptive words for the orientation of a face than a hand)
So yeah, faces do work, and regardless of my thoughts on uncanny valley of some faces+emotions, I actually do think hands (OP subject) are mostly a geometric complexity thing, maybe we see our own hands so much that we are more sensitive to error? But they don’t have the same meaning to them as faces for me (minute differences for slightly different emotions, and benefitting perhaps from being able to accurately tell).
FWIW, I would distinguish between the conditional task of ‘generating a hand/face accurately matching a particular natural language description’ and the unconditional task of ‘generating hands/faces’. A model can be good at unconditional generation but then bad at conditional generation because they, say, have a weak LLM or they use BPE tokenization or the description is too long. A model may know perfectly well how to model hands in many positions but then just not handle language perfectly well. One interesting recent paper on the sometimes very different levels of capabilities depending on the directions you’re going in modalities: “The Generative AI Paradox: “What It Can Create, It May Not Understand”″, West et al 2023.
Yeah, agreed. See this example (using the same image generation model as in the OP) if anyone is still not convinced.