gwern comments on Making DALL-E Count

gwern 4 Aug 2022 14:03 UTC
3 points
1

I am also trying to use it as a sort of case study in how visual thinking might work in humans.

There is no reason to think that studying unCLIP pathologies tells you anything about how human visual perception work in the first place, and it is actively misleading to focus on it when you know how it works, why it fails in the way it does, that it was chosen for pragmatic reason (diversifying samples for a SaaS business) completely unrelated to human psychology, or that more powerful models have already been show to exhibit much better counting, text generation, and visual reasoning. You might as well try to learn how intelligence works from studying GCC compile errors with -Wall -Werror turned on. It is strange and a waste of time to describe it this way, and anyone who reads this and then reads

This pushes me somewhat toward a concept of human psychology in which our brains are composed of a large assemblage of specialized training modules for a variety of tasks. These training modules are interconnected. For example, those of us who received an education in arithmetic have it available for use in a wide variety of tasks. Learning “when to apply arithmetic” is probably also a specialized module.

is worse off. None of that is shown by these results. I don’t know how you get from ‘unCLIP breaks X’ to ‘human brains may be modularized’ in the first place, the other issues with trying to learn anything from unCLIP aside.

This suggests to me that advanced AI will come from designing systems that can learn new, discrete tasks (addition, handling wine glasses, using tone of voice to predict what words a person will say). It will then need to be able to figure out when and how to combine then in particular contexts in order to achieve results in the world. My guess is that children do this by open-ended copying—trying to figure out some aspect of adult behavior that’s within their capabilities, and them copying it with greater fidelity, using adult feedback to guide their behavior, until they succeed and the adult signals that they’re “good enough.”

? So, these systems, like GPT-3, T5, Parti, Imagen, DALL-E’s GLIDE etc which were all trained on unsupervised learning on old non-discrete tasks—just dumps of Internet scraped data—and which successfully learn to do these things like count in their modalities much better than DALL-E 2, will need to be trained on ‘new discrete tasks’, in order to learn to do the things that they already do better than DALL-E 2?

As for your discussion about how this is evidence for one should be totally redesigning the educational system around Direct Instruction, well, I am sympathetic, but again, this doesn’t provide any evidence for that, and if it did, then it would backfire on you because by conservation of evidence, the fact that all the other systems do what DALL-E 2 don’t must then be far more evidence in the other direction that one should redesign the educational system the opposite way to mix together tasks as much as possible and de-modularize everything.

I didn’t find the “steps forward” or “statistical pattern matching” bits in either the tweets you linked, Ullmann’s paper, or in my own post.

‘Steps forward’ is in Ullmann’s last tweet:

In the paper, we discuss both successes and failures, and offer some steps forward.

Personally, I would advise them to step sideways, to studying existing systems which don’t use unCLIP and which one might actually learn something from other than “boy, unCLIP sure does suck for the things it sucks on”. Ullmann et al in the paper discussion further go on to not mention unCLIP (merely a hilarious handwave about ‘technical minutiae’ - yeah, you know, that technical minutiae which make DALL-E 2 DALL-E 2 rather than just GLIDE), compare DALL-E 2 unfavorably to infants, talk about how all modern algorithms are fundamentally wrong because lacking a two-stream architecture comparable to humans, and sweepingly say

DALL-E 2 and other current image generation models are things of wonder, but they also leave us wondering what exactly they have learned, and how they fit into the larger search for artificial intelligence.

(Yes, I too wonder what ‘they have learned’, Ullmann et al, considering that you didn’t study what has been learned by any of the ‘other current image generation models’ and yet consistently imply that all your DALL-E 2 specific results apply equally to them when that’s been known from the start to be false.)

The phrase “Statistical pattern matching” is in the second-most liked reply and echoed by others.

You are saying that DALL-E’s underpinnings are not like the human mind’s, and that it’s not drawing on AI architectures that mimic it, and hence that we cannot learn about the human mind from studying DALL-E.

No. I am saying DALL-E 2 is deliberately broken, in known ways, to get a particular tradeoff. We can potentially learn a lot about the human brain even from AI systems which were not explicitly designed to imitate it & be as biologically plausible (and it is fairly common to try to use GPT or CNNs or ViTs to directly study the human brain’s language or vision right now). We cannot learn anything from examining the performance on particular tasks of systems broken on those specific tasks. OA deliberately introduced unCLIP to sacrifice precision of text input embedding, including things like relations and numbers, to improve the vibe of samples; therefore, those are the things which are least worth studying, and most misleading, and yet what you and Ullmann insist on studying while insistent on ignoring that.