Have you thought of exploring the existing literature on the complex dynamics of nervous systems. It’s huge, but it does use the math you guys are borrowing from physics.
I’m thinking in particular of the work of the late Walter Freeman, who is a pioneer in the field. Toward the end of his career he began developing a concept of “cinematic consciousness.” As you know the movement in motion pictures is an illusion created by the fact the individual frames of the image are projected on the screen more rapidly than the mind can resolve them. So, while the frames are in fact still, they change so rapidly that we see motion.
First I’ll give you some quotes from Freeman’s article to give you a feel for his thinking (alas, you’ll have to read the article to see how those things connect up), then I’ll explain what that has to do with LLMs. The numbers are from Freeman’s article.
[20] EEG evidence shows that the process in the various parts occurs in discontinuous steps (Figure 2), like frames in a motion picture (Freeman, 1975; Barrie, Freeman and Lenhart, 1996).
[23] Everything that a human or an animal knows comes from the circular causality of action, preafference, perception, and up-date. It is done by successive frames of self-organized activity patterns in the sensory and limbic cortices. [...]
[35] EEG measurements show that multiple patterns self-organize independently in overlapping time frames in the several sensory and limbic cortices, coexisting with stimulus-driven activity in different areas of the neocortex, which structurally is an undivided sheet of neuropil in each hemisphere receiving the projections of sensory pathways in separated areas. [...]
[86] Science provides knowledge of relations among objects in the world, whereas technology provides tools for intervention into the relations by humans with intent to control the objects. The acausal science of understanding the self distinctively differs from the causal technology of self-control. “Circular causality” in self-organizing systems is a concept that is useful to describe interactions between microscopic neurons in assemblies and the macroscopic emergent state variable that organizes them. In this review intentional action is ascribed to the activities of the subsystems. Awareness (fleeting frames) and consciousness (continual operator) are ascribed to a hemisphere-wide order parameter constituting a global brain state. Linear causal inference is appropriate and essential for planning and interpreting human actions and personal relations, but it can be misleading when it is applied to microscopic- microscopic relations in brains.
Notice that Freeman refers to “a hemisphere-wide order parameter constituting a global brain state.” The cerebral cortex consists of 16B neurons, each with roughly 10K connections. Further, all areas of the cortex have connections with subcortical regions. That’s an awful-lot of neurons communicating in parallel in a single time step. As I recall from another article, these frames occur at a rate of 6-7 Hz.
The nervous system operates in parallel. I believe it is known that the brain exhibits a small world topology, so all neurons are within a relatively small number links from one another. Though at any moment some neurons will be more active than others, they are all active – the only inactive neuron is a dead neuron. Similarly, ANNs exhibit a high degree of parallelism. LLMs are parallel virtual machines being simulated by so-called von Neumann machines. The use of multiple cores gives a small degree of parallelism, but that’s quite small in relation to the overall number of parameters the system has.
I propose that the process of generating a single token in an LLM is comparable to a single “frame” of consciousness in Freeman’s model. All the parameters in the system are visited during a single time-step for the system. In the case of ChatGPT I believe that’s 175B parameters.
Thus the assertion that ChatGPT generates one token at a time, based on the previous string, while true, is terribly reductive and thus misleading. The appearance of a token is in fact more or less a side-effect of evolving a trajectory from the initial prompt.
Andrew M. Saxe, James L. McClelland, and Surya Ganguli, A mathematical theory of semantic development in deep neural networks, PNAS, vol. 116, no. 23, June 4, 2019, 11537-11546, https://www.pnas.org/content/116/23/11537
Abstract: An extensive body of empirical research has revealed remarkable regularities in the acquisition, organization, deployment, and neural representation of human semantic knowledge, thereby raising a fundamental conceptual question: What are the theoretical principles governing the ability of neural networks to acquire, organize, and deploy abstract knowledge by integrating across many individual experiences? We address this question by mathematically analyzing the nonlinear dynamics of learning in deep linear networks. We find exact solutions to this learning dynamics that yield a conceptual explanation for the prevalence of many disparate phenomena in semantic cognition, including the hierarchical differentiation of concepts through rapid developmental transitions, the ubiquity of semantic illusions between such transitions, the emergence of item typicality and category coherence as factors controlling the speed of semantic processing, changing patterns of inductive projection over development, and the conservation of semantic similarity in neural representations across species. Thus, surprisingly, our simple neural model qualitatively recapitulates many diverse regularities underlying semantic development, while providing analytic insight into how the statistical structure of an environment can interact with nonlinear deep-learning dynamics to give rise to these regularities.
BTW, the Annual Review of Condensed Matter Physics has an article on Statistical Mechanics of Deep Learning, by some people from Google Brain and Stanford. I believe the Annual Reviews are now all open access, so you might what to look around. The Annual Review of Linguistics might have some stuff for you.
Have you thought of exploring the existing literature on the complex dynamics of nervous systems. It’s huge, but it does use the math you guys are borrowing from physics.
I’m thinking in particular of the work of the late Walter Freeman, who is a pioneer in the field. Toward the end of his career he began developing a concept of “cinematic consciousness.” As you know the movement in motion pictures is an illusion created by the fact the individual frames of the image are projected on the screen more rapidly than the mind can resolve them. So, while the frames are in fact still, they change so rapidly that we see motion.
First I’ll give you some quotes from Freeman’s article to give you a feel for his thinking (alas, you’ll have to read the article to see how those things connect up), then I’ll explain what that has to do with LLMs. The numbers are from Freeman’s article.
Notice that Freeman refers to “a hemisphere-wide order parameter constituting a global brain state.” The cerebral cortex consists of 16B neurons, each with roughly 10K connections. Further, all areas of the cortex have connections with subcortical regions. That’s an awful-lot of neurons communicating in parallel in a single time step. As I recall from another article, these frames occur at a rate of 6-7 Hz.
The nervous system operates in parallel. I believe it is known that the brain exhibits a small world topology, so all neurons are within a relatively small number links from one another. Though at any moment some neurons will be more active than others, they are all active – the only inactive neuron is a dead neuron. Similarly, ANNs exhibit a high degree of parallelism. LLMs are parallel virtual machines being simulated by so-called von Neumann machines. The use of multiple cores gives a small degree of parallelism, but that’s quite small in relation to the overall number of parameters the system has.
I propose that the process of generating a single token in an LLM is comparable to a single “frame” of consciousness in Freeman’s model. All the parameters in the system are visited during a single time-step for the system. In the case of ChatGPT I believe that’s 175B parameters.
Thus the assertion that ChatGPT generates one token at a time, based on the previous string, while true, is terribly reductive and thus misleading. The appearance of a token is in fact more or less a side-effect of evolving a trajectory from the initial prompt.
Thanks very much for these comments and pointers. I’ll look at them closely and point some others at them too.
You might also look at this:
Andrew M. Saxe, James L. McClelland, and Surya Ganguli, A mathematical theory of semantic development in deep neural networks, PNAS, vol. 116, no. 23, June 4, 2019, 11537-11546, https://www.pnas.org/content/116/23/11537
Abstract: An extensive body of empirical research has revealed remarkable regularities in the acquisition, organization, deployment, and neural representation of human semantic knowledge, thereby raising a fundamental conceptual question: What are the theoretical principles governing the ability of neural networks to acquire, organize, and deploy abstract knowledge by integrating across many individual experiences? We address this question by mathematically analyzing the nonlinear dynamics of learning in deep linear networks. We find exact solutions to this learning dynamics that yield a conceptual explanation for the prevalence of many disparate phenomena in semantic cognition, including the hierarchical differentiation of concepts through rapid developmental transitions, the ubiquity of semantic illusions between such transitions, the emergence of item typicality and category coherence as factors controlling the speed of semantic processing, changing patterns of inductive projection over development, and the conservation of semantic similarity in neural representations across species. Thus, surprisingly, our simple neural model qualitatively recapitulates many diverse regularities underlying semantic development, while providing analytic insight into how the statistical structure of an environment can interact with nonlinear deep-learning dynamics to give rise to these regularities.
BTW, the Annual Review of Condensed Matter Physics has an article on Statistical Mechanics of Deep Learning, by some people from Google Brain and Stanford. I believe the Annual Reviews are now all open access, so you might what to look around. The Annual Review of Linguistics might have some stuff for you.
You’re welcome.