My model is that LLMs use something like “intuition” rather than “rules” to predict text—even though intuitions can be expressed in terms of mathematical rules, just more fluid ones than we usually see “rules”.
My specific guess is that the gradient descent process that produced GPT has learned to identify high-level patterns/structures in texts (and specifically, stories), and uses them to guide its prediction.
So, perhaps, as it is predicting the next token, it has a “sense” of: -that the text it is writing/predicting is a story -what kind of story it is -which part of the story it is in now -perhaps how the story might end (is this a happy story or a sad story?)
This makes me think of top-down vs bottom-up processing. To some degree, the next token is predicted by the local structures (grammar, sentence structure, etc). To some degree, the next token is predicted by the global structures (the narrative of a story, the overall purpose/intent of the text). (there are also intermediate layers of organization, not just “local” and “global”). I imagine that GPT identifies both the local structures and the global structures (has neuron “clusters” that detect the kind of structures it is familiar with), and synergizes them into its probability outputs for next token prediction.
My model is that LLMs use something like “intuition” rather than “rules” to predict text—even though intuitions can be expressed in terms of mathematical rules, just more fluid ones than we usually see “rules”.
My specific guess is that the gradient descent process that produced GPT has learned to identify high-level patterns/structures in texts (and specifically, stories), and uses them to guide its prediction.
So, perhaps, as it is predicting the next token, it has a “sense” of:
-that the text it is writing/predicting is a story
-what kind of story it is
-which part of the story it is in now
-perhaps how the story might end (is this a happy story or a sad story?)
This makes me think of top-down vs bottom-up processing. To some degree, the next token is predicted by the local structures (grammar, sentence structure, etc). To some degree, the next token is predicted by the global structures (the narrative of a story, the overall purpose/intent of the text). (there are also intermediate layers of organization, not just “local” and “global”). I imagine that GPT identifies both the local structures and the global structures (has neuron “clusters” that detect the kind of structures it is familiar with), and synergizes them into its probability outputs for next token prediction.
Makes sense to me.
I wonder if those induction heads identified by the folks at Anthropic played a role in identifying those “high-level patterns/structures in texts...”