On possible cross-fertilization between AI and neuroscience [Creativity]
Cross-posted from New Savanna.
MIT Center for Minds, Brains, and Machines (CBMM), a panel discussion: CBMM10 - A Symposium on Intelligence: Brains, Minds, and Machines.
On which critical problems should Neuroscience, Cognitive Science, and Computer Science focus now? Do we need to understand fundamental principles of learning—in the sense of theoretical understanding like in physics—and apply this understanding to real natural and artificial systems? Similar questions concern neuroscience and human intelligence from the society, industry and science point of view.
Panel Chair: T. Poggio
Panelists: D. Hassabis, G. Hinton, P. Perona, D. Siegel, I. Sutskever
Quick Comments
1.) I’m a bit annoyed that Hassabis is giving neuroscience credit for the idea of episodic memory. As far as I know, the term was coined by a cognitive psychologist named Endel Tulving in the early 1970s, who stood it in opposition to semantic memory. That distinction was all over the place in the cognitive sciences in the 1970s and its second nature to me. When ChatGPT places a number of events in order to make a story, that’s episodic memory.
2.) Rather than theory, I like to think of what I call speculative engineering. I coined the phrase in the preface to my book about music (Beethoven’s Anvil), where I said:
Engineering is about design and construction: How does the nervous system design and construct music? It is speculative because it must be. The purpose of speculation is to clarify thought. If the speculation itself is clear and well-founded, it will achieve its end even when it is wrong, and many of my speculations must surely be wrong. If I then ask you to consider them, not knowing how to separate the prescient speculations from the mistaken ones, it is because I am confident that we have the means to sort these matters out empirically. My aim is to produce ideas interesting, significant, and clear enough to justify the hard work of investigation, both through empirical studies and through computer simulation.
3.) On Chomsky (Hinton & Hassabis): Yes, Chomsky is fundamentally wrong about language. Language is primarily a tool for conveying meaning from one person to another and only derivatively a tool for thinking. And he’s wrong that LLMs can learn any language and therefore they are useless for the scientific study of language. Another problem with Chomsky’s thinking is that he has no interest in process, which is in the realm of performance, not competence.
Let us assume for the sake of argument that the introduction of a single token into the output stream requires one primitive operation of the virtual system being emulated by an LLM. By that I mean that there is no logical operation within the process, no AND or OR, no shift of control; all that’s happening is one gigantic calculation involving all the parameters in the system. That means that the number of primitive operations required to produce a given output is equal to the number of tokens in that output. I suggest that that places severe constraints on the organization of the LLM’s associative memory.
Contrast that with what happens in a classical symbolic system. Let us posit that each time a word (not quite the same as a token in an LLM, but the difference is of no consequence) is emitted, that itself requires a single primitive operation in the classical system. Beyond that, however, a classical system has to execute numerous symbolic operations in order to arrive at each word. Regardless of just how those operations resolve into primitive symbolic operations, the number has to be larger, perhaps considerably larger, than the number of primitive operations an LLM requires. I suggest that this process places fewer constraints on the organization of a symbolic memory system.
At this point I’ve reached 45:11 in the video, but I have to stop and think. Perhaps I’ll offer some more comments later.
LATER: Creativity
4.) Near the end (01:20:00 or so) the question of creativity comes up. Hassibis AIs aren’t there yet. Hinton brings up analogy, pointing out that, with all the vast knowledge LLMs have ingested, they’re got opportunities for coming up with analogy after analogy after analogy. I’ve got experience with ChatGPT that’s directly relevant to those issues, analogy and creativity.
One of the first things I did once I started playing with ChatGPT was have it undertake a Girardian interpretation of Steven Spielberg’s Jaws. To do that it has to determine whether or not there is an analogy between events in the film and the phenomena that Girard theorizes about. It did that fairly well. So I wrote that up and published it in 3 Quarks Daily, Conversing with ChatGPT about Jaws, Mimetic Desire, and Sacrifice. Near the end I remarked:
I was impressed with ChatGPT’s capabilities. Interacting with it was fun, so much fun that at times I was giggling and laughing out loud. But whether or not this is a harbinger of the much-touted Artificial General Intelligence (AGI), much less a warning of impending doom at the hands of an All-Knowing, All-Powerful Superintelligence – are you kidding? Nothing like that, nothing at all. A useful assistant for a variety of tasks, I can see that, and relatively soon. Maybe even a bit more than an assistant. But that’s as far as I can see.
We can compare what ChatGPT did in response to my prompting with what I did unprompted, freely and of my own volition. There’s nothing its replies that approaches my article, Shark City Sacrifice, nor the various blog posts I wrote about the film. That’s important. I was neither expecting, much less hoping, that ChatGPT would act like a full-on AGI. No, I have something else in mind.
What’s got my attention is what I had to do to write the article. In the first place I had to watch the film and make sense of it. As I’ve already indicated, have no artificial system with the required capabilities, visual, auditory, and cognitive. I watched the film several times in order to be sure of the details. I also consulted scripts I found on the internet. I also watched Jaws 2 more than once. Why did I do that? There’s curiosity and general principle. But there’s also the fact that the Wikipedia article for Jaws asserted that none of the three sequels were as good as the original. I had to watch the others to see for myself – though I was unable to finish watching either of that last two.
At this point I was on the prowl, though I hadn’t yet decided to write anything.
I now asked myself why the original was so much better than the first sequel, which was at least watchable. I came up with two things: 1) the original film was well-organized and tight while the sequel sprawled, and 2) Quint, there was no character in the sequel comparable to Quint.
Why did Quint die? Oh, I know what happened in the film; that’s not what I was asking. The question was an aesthetic one. As long as the shark was killed the town would be saved. That necessity did not entail the Quint’s death, nor anyone else’s. If Quint hadn’t died, how would the ending have felt? What if it had been Brody or Hooper?
It was while thinking about such questions that it hit me: sacrifice! Girard! How is it that Girard’s ideas came to me. I wasn’t looking for them, not in any direct sense. I was just asking counter-factual questions about the film.
Whatever.
Once Girard was on my mind I smelled blood, that is, the possibility of writing an interesting article. I started reading, making notes, and corresponding with my friend, David Porush, who knows Girard’s thinking much better than I do. Can I make a nice tight article? That’s what I was trying to figure out. I was only after I’d made some preliminary posts, drafted some text, and run it by David that I decided to go for it. The article turned out well enough that I decided to publish it. And so I did.
It’s one thing to figure out whether or not such and such a text/film exhibits such and such pattern when you are given the text and the pattern. That’s what ChatGPT did. Since I had already made the connection between Girard and Jaws it didn’t have to do that. I was just prompting ChatGPT to verify the connection, which it did (albeit in a weak way). That’s the kind of task we set for high school students and lower division college students. […]
I don’t really think that ChatGPT is operating at a high school level in this context. Nor do I NOT think that. I don’t know quite what to think. And I’m happy with that.
The deeper point is that there is a world of difference between what ChatGPT was doing when I piloted it into Jaws and Girard and what I eventually did when I watched Jaws and decided to look around to see what I could see. How is it that, in that process, Girard came to me? I wasn’t looking for Girard. I wasn’t looking for anything in particular. How do we teach a computer to look around for nothing in particular and come up with something interesting?
These observations are informal and are only about a single example. Given those limitations it’s difficult to imagine a generalization. But I didn’t hear anything from those experts that was comparably rich. Hinton have one example of an analogy that he posed to GPT-4 (01:18:30): “What has a compost heap got in common with an atom bomb?” It got the answer he was looking for, chain reaction, albeit at different energy levels and different rates. That’s interesting. Why wasn’t the panel ready with 20 such examples among them?
Do they not have more such examples from their own work? Don’t they think about their own work process, all the starts and stops, the wandering around, the dead ends and false starts, the open-ended exploration, that came before final success. And even then, no success is final, but only provisional pending further investigation. Can they not see the difference between what they do and what their machines do? Do they think all the need for exploration will just vanish in the face of machine superintelligence. Do they really believe that the universe is that small?
STILL LATER: Hinton and Hassabis on analogies
Hinton continues with analogies and Hassabis weights in:
1:18:28 – GEOFFREY HINTON: We know that being able to see analogies, especially remote analogies, is a very important aspect of intelligence. So I asked GPT-4, what has a compost heap got in common with an atom bomb? And GPT-4 nailed it, most people just say nothing. DEMIS HASSABIS: What did it say …
1:19:09 – And the thing is, it knows about 10000 times as much as a person, so it’s going to be able to see all sorts of analogies that we can’t see. DEMIS HASSABIS: Yeah. So my feeling is on this, and starting with things like AlphaGo and obviously today’s systems like Bard and GPT, they’re clearly creative in …
1:20:18 – New pieces of music, new pieces of poetry, and spotting analogies between things you couldn’t spot as a human. And I think these systems can definitely do that. But then there’s the third level which I call like invention or out-of-the-box thinking, and that would be the equivalent of AlphaGo inventing Go.
Well, yeah, sure, GPT-4 has all this stuff in its model, way more topics than any one human. But where’s GPT-4 going to “stand” so it can “look over” all that stuff and spot the analogies? That requires some kind of procedure. What is it?
For example, it might partition all that knowledge into discrete bits and then set up a 2D matrix with a column and a row for each discrete chunk of knowledge. Then it can move systematically through the matrix, checking each cell to see whether or not the pair in that cell is a useful analogy. What kind of tests does it apply to make that determination? I can imagine there might be a test or tests that allows a quick and dirty rejection for many candidates. But those that remain, what can you do but see if any useful knowledge follows from trying out the analogy. How long will that determination take? And so forth.
That’s absurd on the face of it. What else is there? I just explained what I went through to come up with an analogy between Jaws and Girard. But that’s just my behavior, not the mental process that’s behind the behavior. I have no trouble imagining that, in principle, having these machines will help speed up the process, but in the end I think we’re going to end up with a community of human investigators communicating with one another while they make sense of the world. The idea, which, judging from remarks he’s made elsewhere, Hinton seems to hold, that one of these days we’ll have a machine that takes humans out of the process all together, that’s an idle fantasy.
That’s not my understanding. To me he is giving neuroscience credit for the ideas that made possible to implement a working memory in LLM. I guess he didn’t want to use words like thalamocortical, but from a neuroscience point of view transformers indeed look inspired by the isocortex, e.g. by the idea that a general distributed architecture can process any kind of information relevant to a human cognitive architecture.
Yeah, he’s talking about neuroscience. I get that. But “episodic memory” is a term of art and the idea behind it didn’t come from neuroscience. It’s quite possible that he just doesn’t know the intellectual history and is taking “episodic memory” as a term that’s in general use, which it is. But he’s also making claims about intellectual history.
Because he’s using that term in that context, I don’t know just what claim he’s making. Is he also (implicitly) claiming that neuroscience is the source of the idea? If he thinks that, then he’s wrong. If he’s just saying that he got the idea from neuroscience, OK.
But, the idea of a “general distributed architecture” doesn’t have anything to do with the idea of episodic memory. They are orthogonal notions, if you will.
Your point is « Good AIs should have a working memory, a concept that comes from psychology ».
DH point is « Good AIs should have a working memory, and the way to implement it was based on concepts taken from neuroscience ».
That’s indeed orthogonal notions, if you will.
I did a little checking. It’s complicated. In 2017 Hassibis published an article entitled “Neuroscience-Inspired Artificial Intelligence” in which he attributes the concept of episodic memory to a review article that Endel Tulving published in 2002, “EPISODIC MEMORY: From Mind to Brain.” That article has quite a bit to say about the brain. In the 2002 article Tulving dates the concept to an article he published in 1972. That article is entitled “Episodic and Semantic Memory.” As far as I know, while there are precedents – everything can be fobbed off on Plato if you’ve a mind to do it, that’s where the notion of episodic memory enters in to modern discussions.
Why do I care about this kind of detail? First, I’m a scholar and it’s my business to care about these things. Second, a lot of people in contemporary AI and ML are dismissive of symbolic AI from the 1950s through the 1980s and beyond. While Tulving was not an AI researcher, he was very much in the cognitive science movement, which included philosophy, psychology, linguistics, and AI (later on, neuroscientists would join in). I have no idea whether or not Hassibis is himself dismissive of that work, but many are. It’s hypocritical to write off the body of work while using some of the ideas. These problems are too deep and difficult to write off whole bodies of research in part because they happened before you were born – FWIW Hassibis was born in 1976.
Well that’s a problem, don’t you think?
Yes, as a cognitive neuroscientist myself, you’re right that many within my generation tend to dismiss symbolic approaches. We were students during a winter that many of us thought caused by the over promising and under delivering of the symbolic approach, with Minsky as the main reason for the slow start of neural networks. I bet you have a different perspective. What’s your three best points for changing the view of my generation?
I’ll get back to you tomorrow. I don’t think it’s a matter of going back to the old ways. ANNs are marvelous; they’re here to stay. The issue is one of integrating some symbolic ideas. It’s not at all clear how that’s to be done. If you wish, take a look at this blog post: Miriam Yevick on why both symbols and networks are necessary for artificial minds.
Fascinating paper! I wonder how much they would agree that holography means sparse tensors and convolution, or that the intuitive versus reflexive thinking basically amount to visuo-spatial versus phonological loop. Can’t wait to hear which other idea you’d like to import from this line of thought.
Miriam Lipshutz Yevick was born in 1924 and died in 2018, so we can’t ask her these questions. She fled Europe with her family inn 1940 for the same reason many Jews fled Europe and ended up in Hoboken, NJ. Seven years later she got a PhD in math from MIT; she was only the 5th woman to get that degree from MIT. But, as both a woman and a Jew, she had almost no chance of an academic post in 1947. She eventually got an academic gig, but it was at a college oriented toward adult education. Still, she managed to do some remarkable mathematical work.
The two papers I mention in that blog post were written in the mid-1970s. That was the height of classic symbolic AI and the cognitive science movement more generally. Newell and Simon got their Turing Award in 1975, the year Yevick wrote that remarkable 1975 paper on holographic logic, which deserves to be more widely known. She wrote as a mathematician interested in holography (an interest she developed while corresponding with physicist David Bohm in the 1950s), not as a cognitive scientist. Of course, in arguing for holography as a model for (one kind of) thought, she was working against the tide. Very few were thinking in such terms at that time. Rosenblatt’s work was in the past, and had been squashed by Minsky and Pappert, as you’ve noted. The West Coast connectionist work didn’t jump off until the mid-1980s.
So there really wasn’t anyone in the cognitive science community at the time to investigate the line of thinking she initiated. While she wasn’t thinking about real computation, you know, something you actually do on computers, she thought abstractly in computational terms, such as Turing and others did (though Turing also worked with actual computers). It seems to me that her contribution was to examine the relationship between a computational regime and the objects over which he was asked to compute. She’s quite explicit about that. If the object tends toward geometrical simplicity – she was using identification of visual objects as her domain – then a conventional, sequential, computational regime was most effective. What’s what cognitive science was all about at the time. If the object tends toward geometrical complexity then a different regime was called for, what she called holographic or Fourier logic. I don’t know about sparse tensors, but convolution, yes.
Later on, in the 1980s, as you may know, Hans Moravic would talk about a paradox (which became named after him). In the early days of AI, researchers worked on abstract domains, like chess and theorem proving, domains that take a high level cognitive ability. Things went pretty well, though the extravagant predictions had yet to pan out. When they turned toward vision and language in the late 1960s and into the 70s and 80s, things fell apart. Those were things that young kids could do. The paradox, then, was that AI was most effective at cognitively difficult things, and least effective with cognitively simple things.
The issue was in fact becoming visible in the 1970s. I read about it in David Marr, and he died in 1980. Had it been explicitly theorized when Yevick wrote? I don’t know. But she had an answer to the paradox. The computational regime favored by AI and the cognitive sciences at the time simply was not well-suited to complex visual objects, though they presented to problems to 2-year-olds, or to language, with all those vaguely defined terms anchored in physically complex phenomena. They needed a different computational regime, and eventually we got one, though not really until GPUs were exploited.
More later, perhaps.
Thanks, I didn’t know this perspective on the history of our science. The stories I most heard were indeed more about HH model, Hebb rule, Kohonen map, RL, and then connexionism became deep learning..
…but neural networks did refute that idea! I feel like I’m missing something here, especially since you then mention GPU. Was sequential a typo?
How so?
When I hear « conventional, sequential, computational regime », my understanding is « the way everyone was trying before parallel computation revolutionized computer vision ». What’s your definition so that using GPU feels sequential?
Oh, I didn’t mean to say imply that using GPUs was sequential, not at all. What I meant was that the connectionist alternative didn’t really take off until GPUs were used, making massive parallelism possible.
Going back to Yevick, in her 1975 paper she often refers to holographic logic as ‘one-shot’ logic, meaning that the whole identification process takes place in one operation, the illumination of the hologram (i.e. the holographic memory store) by the reference beam. The whole memory ‘surface’ is searched in one unitary operation.
In an LLM, I’m thinking of the generation of a single token as such a unitary or primitive process. That is to say, I think of the LLM as a “virtual machine” (I first saw the phrase in a blog post by Chris Olah) that is running an associative memory machine. Physically, yes, we’ve got a massive computation involving every parameter and (I’m assuming) there’s a combination of massive parallel and sequential operations taking place in the GPUs. Complete physical parallelism isn’t possible (yet). But there are no logical operations taking place in this virtual operation, no transfer of control. It’s one operation.
Obviously, though, considered as an associative memory device, an LLM is capable of much more than passive storage and retrieval. It performs analytic and synthetic operations over the memory based on the prompt, which is just a probe (‘reference beam’ in holographic terms) into an associative memory. We’ve got to understand how the memory is structured so that that is possible.
More later.
A few comments before later. 😉
Thanks for the clarification! I guess you already noticed how research centers in cognitive science seem to have a failure mode over a specific value question: Do we seek excellence at the risk of overfitting funding agency criterion, or do we seek fidelity to our interdisciplinary mission at the risk of compromising growth?
I certainly agree that, before the GPUs, the connectionist approach had a very small share of the excellence tokens. But it was already instrumental in providing a common conceptual framework beyond cognitivism. As an example, even the first PCs were enough to run toy examples of double dissociation using networks structured by sensory type rather than by cognitive operation. From a neuropsychological point of view, that was already a key result. And for the neuroscientist in me, toy models like Kohonen maps were already key to make sense of why we need so many short inhibitory neurons in grid-like cortical structures.
Like a refresh rate? That would fit the evidence for a 3-7 Hz refresh rate of our cartesian theater, or the way LLMs go through prompt/answer cycles. Do you see other potential uses for this concept?
What’s wrong with « the distributed way »?
In a paper I wrote awhile back I cite the late Walter Freeman as arguing that “consciousness arises as discontinuous whole-hemisphere states succeeding one another at a “frame rate” of 6 Hz to 10 Hz” (p. 2). I’m willing to speculate that that’s your ‘one-shot’ refresh rate. BTW, Freeman didn’t believe in a Cartesian theater and neither do it; the imagery of the stage ‘up there’ and the seating area ‘back here’ is not at all helpful. We’re not talking about some specific location or space in the brain; we’re talking about a process.
Well, of course, “the distributed way.” But what is that? Prompt engineering is about maneuvering your way through the LLM; you’re attempting to manipulate the structure inherent in those weights to produce a specific result you want.
That 1978 comment of Yevick’s that I quote in that blog post I mentioned somewhere up there, was in response to an article by John Haugeland evaluating cognitivism. He wondered whether or not there was an alternative and suggested holography as a possibility. He didn’t make a very plausible case and few of the commentators took is as a serious alternative.
People were looking for alternatives. But it took awhile for connectionism to build up a record of interesting results, on the one hand, for cognitivism to begin seeming stale on the other hand. It’s the combination of the two that brought about significant intellectual change. Or that’s my speculation.
It’s possible. I don’t think there was relevant human data in Walter Freeman time, so I’m willing to speculate that’s indeed the frame rate in mouse. But I didn’t check the literature he had access to, so just a wild guess.
I agree there’s no seating area. I still find the concept of a cartesian theater useful. For exemple, it allows knowing where to plant electrodes if you want to access the visual cartesian theater for rehabilitation purposes. I guess you’d agree that can be helpful. 😉
I have friends who believe that, but they can’t explain why the brain needs that much ordering in the sensory areas. What’s your own take?
You know backprop algorithm? That’s a mathematical model for the distributed way. It was recently shown that it produces networks that explains (statistically speaking) most the properties of the BOLD cortical response in our visial systems. So, whatever the biological cortices actually do, it turns equivalent for the « distributed memory » aspect.
I wonder if that’s too flattering for connectionism, which mostly stalled until the early breakthrough in computer vision suddenly attract every labs. BTW
Is accessing the visual cartesian theater physically different from accessing the visual cortex? Granted, there’s a lot of visual cortex, and different regions seem to have different functions. Is the visual cartesian theater some specific region of visual cortex?
I’m not sure what your question about ordering in sensory areas is about.
As for backprop, that gets the distribution done, but that’s only part of the problem. In LLMs, for example, it seems that syntactic information is handled in the first few layers of the model. Given the way texts are structured, it makes sense that sentence-level information should be segregated from information about collections of sentences. That’s the kind of structure I’m talking about. Sure, backprop is responsible for those layers, but it’s responsible for all the other layers as well. Why do we seem to have different kinds of information in different layers at all? That’s what interests me.
Actually, it just makes sense to me that that is the case. Given that it is, what is located where? As for why things are segregated by location, that does need an answer, doesn’t it. Is that what you were asking?
Finally, here’s an idea I’ve been playing around with for a long time: Neural Recognizers: Some [old] notes based on a TV tube metaphor [perceptual contact with the world].
In my view: yes, no. To put some flesh on the bone, my working hypothesis is: what’s conscious is gamma activity within an isocortex connected to the claustrum (because that’s the information which will get selected for the next conscious frame/can be considered as in working memory)
You said: what matters is temporal dynamics. I said: why so many maps if what matters is timing?
The closer to the input, the more sensory. The closer to the output, the more motor. The closer to the restrictions, the easier to interpret activity as latent space. Is there any regularity that you feel hard to interpret this way?
Thanks, I’ll go read. Don’t hesitate to add other links that can help understand your vision.
“You said: what matters is temporal dynamics”
You mean this: “We’re not talking about some specific location or space in the brain; we’re talking about a process.”
If so, all I meant was a process that can take place pretty much anywhere. Consciousness can pretty much ‘float’ to wherever its needed.
Since you asked for more, why not this: Direct Brain-to-Brain Thought Transfer: A High Tech Fantasy that Won’t Work.
You mean there’s some key difference in meaning between your original formulation and my reformulation? Care to elaborate and formulate some specific prediction?
As an example, I once gave a try at interpreting data from olfactory system for a friend who were wondering if we could find sign of an chaotic attractor. If you ever toy with Lorenz model, one key feature is: you either see the attractor by plotting x vs vs z, or you can see it by plotting one of these variable only vs itself at t+delta vs itself at t+2*delta (for many deltas). In other words, that gives a precise feature you can look for (I didn’t find any, and nowadays it seems accepted that odors are location specific, like every other sense). Do you have a better idea or it’s more or less what you’d have tried?
I’ve lost the thread entirely. Where have I ever said or implied that odors are not location specific or that anything else is not location specific. And how specific are you about location? Are we talking about centimeters (or more), millimeters, individual cortical columns?
What’s so obscure about the idea that consciousness is a process that can take place pretty much anywhere, though maybe its confined to interaction within the cortex and between subcortical areas, I’ve not given that one much thought. BTW, I take my conception of consciousness from William Powers, who didn’t speculation about its location in the brain.
Nothing at all. I’m big fan of these kind of ideas and I’d love to present yours to some friends, but I’m afraid they’ll get dismissive if I can’t translate your thoughts into their usual frame of reference. But I get you didn’t work this aspect specifically, there’s many fields in cognitive sciences.
About how much specificity, it’s up to interpretation. A (1k by 1k by frame by cell type by density) tensor representing the cortical columns within the granular cortices is indeed a promising interpretation, although it’d probably be short of an extrapyramidal tensor (and maybe an agranular one).
Well, when Walter Freeman was working on the olfactory cortex of rodents he was using a surface mounted 8x8 matrix of electrodes. I assume that measured in millimeters. In his 1999 paper Consciousness, Intentionality, and Causality (paragraphs 36 − 43) a hemisphere-wide global operator (42):
Later (43):
He goes on from there. I’m not sure whether he came back to that idea before he died in 2016. I haven’t found it, didn’t do an exhaustive search, but I did look.