So here was my initial quick test, I haven’t spent much time on this either, but have seen the same images of faces on subreddits etc. and been v impressed. I think asking for emotions was a harder challenge vs just making a believable face/hand, oops
I really appreciate your descriptions of the distinctive features of faces and of pareidolia, and do agree that faces are more often better represented than hands, specifically hands often have the more significant/notable issues (misshapen/missing/overlapped fingers). Versus with faces where there’s nothing as significant as missing an eye, but it can be hard to portray something more specific like an emotion (though same can be said for, e.g. getting Dalle not to flip me off when I ask for an index finger haha).
Rather difficult to label or prompt a specific hand orientation you’d like as well, versus I suppose, an emotion (a lot more descriptive words for the orientation of a face than a hand)
So yeah, faces do work, and regardless of my thoughts on uncanny valley of some faces+emotions, I actually do think hands (OP subject) are mostly a geometric complexity thing, maybe we see our own hands so much that we are more sensitive to error? But they don’t have the same meaning to them as faces for me (minute differences for slightly different emotions, and benefitting perhaps from being able to accurately tell).
FWIW, I would distinguish between the conditional task of ‘generating a hand/face accurately matching a particular natural language description’ and the unconditional task of ‘generating hands/faces’. A model can be good at unconditional generation but then bad at conditional generation because they, say, have a weak LLM or they use BPE tokenization or the description is too long. A model may know perfectly well how to model hands in many positions but then just not handle language perfectly well. One interesting recent paper on the sometimes very different levels of capabilities depending on the directions you’re going in modalities: “The Generative AI Paradox: “What It Can Create, It May Not Understand”″, West et al 2023.
So here was my initial quick test, I haven’t spent much time on this either, but have seen the same images of faces on subreddits etc. and been v impressed. I think asking for emotions was a harder challenge vs just making a believable face/hand, oops
I really appreciate your descriptions of the distinctive features of faces and of pareidolia, and do agree that faces are more often better represented than hands, specifically hands often have the more significant/notable issues (misshapen/missing/overlapped fingers). Versus with faces where there’s nothing as significant as missing an eye, but it can be hard to portray something more specific like an emotion (though same can be said for, e.g. getting Dalle not to flip me off when I ask for an index finger haha).
Rather difficult to label or prompt a specific hand orientation you’d like as well, versus I suppose, an emotion (a lot more descriptive words for the orientation of a face than a hand)
So yeah, faces do work, and regardless of my thoughts on uncanny valley of some faces+emotions, I actually do think hands (OP subject) are mostly a geometric complexity thing, maybe we see our own hands so much that we are more sensitive to error? But they don’t have the same meaning to them as faces for me (minute differences for slightly different emotions, and benefitting perhaps from being able to accurately tell).
FWIW, I would distinguish between the conditional task of ‘generating a hand/face accurately matching a particular natural language description’ and the unconditional task of ‘generating hands/faces’. A model can be good at unconditional generation but then bad at conditional generation because they, say, have a weak LLM or they use BPE tokenization or the description is too long. A model may know perfectly well how to model hands in many positions but then just not handle language perfectly well. One interesting recent paper on the sometimes very different levels of capabilities depending on the directions you’re going in modalities: “The Generative AI Paradox: “What It Can Create, It May Not Understand”″, West et al 2023.