DirectedEvolution comments on Making DALL-E Count

DirectedEvolution Aug 4, 2022, 5:24 AM
2 points
0
Yeah, all that stuff about priors and needing new architecture paradigms is great and all that, but maybe you should show this for literally anything but DALL-E 2 (ie. unCLIP), like DALL-E 2′s own GLIDE, first, before you start talking about “steps forward” or it’s just “statistical pattern matching”… Good grief.
I wasn’t sure how to interpret this part in relation to my post here. I didn’t find the “steps forward” or “statistical pattern matching” bits in either the tweets you linked, Ullmann’s paper, or in my own post. It seems like you are inferring that I’m throwing shade on DALL-E, or trying to use DALL-E’s inability to count as a “point of evidence” against the hypothesis that AGI could develop from a contemporary AI paradigm? That’s not my intention.
Instead, I am trying to figure out the limits of this particular AI system. I am also trying to use it as a sort of case study in how visual thinking might work in humans.
I appreciate your argument that we can only use an AI system like DALL-E as a reference for the human mind insofar as we think it is constructed in a fundamentally similar way. You are saying that DALL-E’s underpinnings are not like the human mind’s, and that it’s not drawing on AI architectures that mimic it, and hence that we cannot learn about the human mind from studying DALL-E.
For context, I’d been wanting to do this experiment since DALL-E was released, and posted it the same day I got my invitation to start using DALL-E. So this isn’t a deeply-considered point about AI (I’m not in CS/AI safety) - it’s a speculative piece. I appreciate the error correction you are doing.
That said, I also did want to note that your tone feels somewhat flamey/belittling here, as well as seeming to make some incorrect assumptions about my beliefs about AI and making up quotes that do not actually belong to me. I would prefer if you’d avoid these behaviors when interacting with me in the future. Thank you.
- gwern Aug 4, 2022, 2:03 PM
  3 points
  1
  Parent
  
  I am also trying to use it as a sort of case study in how visual thinking might work in humans.
  
  There is no reason to think that studying unCLIP pathologies tells you anything about how human visual perception work in the first place, and it is actively misleading to focus on it when you know how it works, why it fails in the way it does, that it was chosen for pragmatic reason (diversifying samples for a SaaS business) completely unrelated to human psychology, or that more powerful models have already been show to exhibit much better counting, text generation, and visual reasoning. You might as well try to learn how intelligence works from studying GCC compile errors with -Wall -Werror turned on. It is strange and a waste of time to describe it this way, and anyone who reads this and then reads
  
  This pushes me somewhat toward a concept of human psychology in which our brains are composed of a large assemblage of specialized training modules for a variety of tasks. These training modules are interconnected. For example, those of us who received an education in arithmetic have it available for use in a wide variety of tasks. Learning “when to apply arithmetic” is probably also a specialized module.
  
  is worse off. None of that is shown by these results. I don’t know how you get from ‘unCLIP breaks X’ to ‘human brains may be modularized’ in the first place, the other issues with trying to learn anything from unCLIP aside.
  
  This suggests to me that advanced AI will come from designing systems that can learn new, discrete tasks (addition, handling wine glasses, using tone of voice to predict what words a person will say). It will then need to be able to figure out when and how to combine then in particular contexts in order to achieve results in the world. My guess is that children do this by open-ended copying—trying to figure out some aspect of adult behavior that’s within their capabilities, and them copying it with greater fidelity, using adult feedback to guide their behavior, until they succeed and the adult signals that they’re “good enough.”
  
  ? So, these systems, like GPT-3, T5, Parti, Imagen, DALL-E’s GLIDE etc which were all trained on unsupervised learning on old non-discrete tasks—just dumps of Internet scraped data—and which successfully learn to do these things like count in their modalities much better than DALL-E 2, will need to be trained on ‘new discrete tasks’, in order to learn to do the things that they already do better than DALL-E 2?
  
  As for your discussion about how this is evidence for one should be totally redesigning the educational system around Direct Instruction, well, I am sympathetic, but again, this doesn’t provide any evidence for that, and if it did, then it would backfire on you because by conservation of evidence, the fact that all the other systems do what DALL-E 2 don’t must then be far more evidence in the other direction that one should redesign the educational system the opposite way to mix together tasks as much as possible and de-modularize everything.
  
  I didn’t find the “steps forward” or “statistical pattern matching” bits in either the tweets you linked, Ullmann’s paper, or in my own post.
  
  ‘Steps forward’ is in Ullmann’s last tweet:
  
  In the paper, we discuss both successes and failures, and offer some steps forward.
  
  Personally, I would advise them to step sideways, to studying existing systems which don’t use unCLIP and which one might actually learn something from other than “boy, unCLIP sure does suck for the things it sucks on”. Ullmann et al in the paper discussion further go on to not mention unCLIP (merely a hilarious handwave about ‘technical minutiae’ - yeah, you know, that technical minutiae which make DALL-E 2 DALL-E 2 rather than just GLIDE), compare DALL-E 2 unfavorably to infants, talk about how all modern algorithms are fundamentally wrong because lacking a two-stream architecture comparable to humans, and sweepingly say
  
  DALL-E 2 and other current image generation models are things of wonder, but they also leave us wondering what exactly they have learned, and how they fit into the larger search for artificial intelligence.
  
  (Yes, I too wonder what ‘they have learned’, Ullmann et al, considering that you didn’t study what has been learned by any of the ‘other current image generation models’ and yet consistently imply that all your DALL-E 2 specific results apply equally to them when that’s been known from the start to be false.)
  
  The phrase “Statistical pattern matching” is in the second-most liked reply and echoed by others.
  
  You are saying that DALL-E’s underpinnings are not like the human mind’s, and that it’s not drawing on AI architectures that mimic it, and hence that we cannot learn about the human mind from studying DALL-E.
  
  No. I am saying DALL-E 2 is deliberately broken, in known ways, to get a particular tradeoff. We can potentially learn a lot about the human brain even from AI systems which were not explicitly designed to imitate it & be as biologically plausible (and it is fairly common to try to use GPT or CNNs or ViTs to directly study the human brain’s language or vision right now). We cannot learn anything from examining the performance on particular tasks of systems broken on those specific tasks. OA deliberately introduced unCLIP to sacrifice precision of text input embedding, including things like relations and numbers, to improve the vibe of samples; therefore, those are the things which are least worth studying, and most misleading, and yet what you and Ullmann insist on studying while insistent on ignoring that.