Like the reward signal in reinforcement learning, next-token prediction is a simple feedback signal that masks a lot of complexity behind the scenes. To predict the next token requires the model first of all to estimate what sort of persona should be speaking, what they know, how they speak, what is the context, and what are they trying to communicate. Self-attention with multiple attention heads at every layer in the Transformer allows the LLM to keep track of all these things. It’s probably not the best way to do it, but it works.
Human brains, and cortex in particular, gives us a powerful way to map all of this sort of information. We can map out our current mental model and predict a map of our interlocutor’s, looking for gaps in each and planning words either to fill in our own gaps (e.g., by asking questions) or to fill in theirs (with what we actually think or with what we want them to think, depending on our goals). I would also say that natural language is actually a sort of programming language, allowing humans to share cognitive programs between minds, programs of behavior or world modelling.
I also asked your question to ChatGPT, and here is what it had to say:
It is difficult to speculate about what might be left of human cognition if we were able to factor out next token prediction, as this is a complex and multifaceted aspect of human thought and communication. There are many other important aspects of human cognition beyond next token prediction, including things like perception, attention, memory, problem-solving, decision-making, emotion, and social cognition.
One aspect of human cognition that may be relevant in this context is the ability to understand and interpret the meaning and intention behind spoken and written communication. This involves not just predicting the next word or phrase, but also being able to understand the context in which it is being used, the relationships between words and phrases, and the overall meaning and purpose of the communication. It also involves the ability to generate and express one’s own ideas and thoughts through language.
Another aspect of human cognition that may be relevant is the ability to experience and process sensory information and emotions. This includes things like the ability to see, hear, touch, taste, and smell, as well as the ability to feel and express emotions.
It is worth noting that these are just a few examples of the many complex and interrelated aspects of human cognition, and it is difficult to fully disentangle and isolate any one aspect from the others.
Here’s my take:
Like the reward signal in reinforcement learning, next-token prediction is a simple feedback signal that masks a lot of complexity behind the scenes. To predict the next token requires the model first of all to estimate what sort of persona should be speaking, what they know, how they speak, what is the context, and what are they trying to communicate. Self-attention with multiple attention heads at every layer in the Transformer allows the LLM to keep track of all these things. It’s probably not the best way to do it, but it works.
Human brains, and cortex in particular, gives us a powerful way to map all of this sort of information. We can map out our current mental model and predict a map of our interlocutor’s, looking for gaps in each and planning words either to fill in our own gaps (e.g., by asking questions) or to fill in theirs (with what we actually think or with what we want them to think, depending on our goals). I would also say that natural language is actually a sort of programming language, allowing humans to share cognitive programs between minds, programs of behavior or world modelling.
I also asked your question to ChatGPT, and here is what it had to say: