I understand your argument as something like “GPT is not just predicting the next token because it clearly ‘plans’ further ahead than just the next token”.
But “looking ahead” is required to correctly predict the next token and (I believe) naturally flows of the paradigm of “predicting the next token”.
That is, based on past experience in similar contexts, it makes its best guess about what will happen next. Is that right? How far back does it look?
I’ve been examining stories that are organized on three levels: 1) the whole story, 2) major segments, and 3) sentences within major segments. The relevant past differs within those segments.
At the level of the whole story, at the beginning the relevant past is either prompt that gave rise to the story, or some ChatGPT text in which a story is called for. At the end of the story, ChatGPT may go into a wait state if it is responding to an external prompt, or pick up where it left off if it told the story in the context of something else – a possibility I think I’ll explore a bit. The the level of a major segment, the relevant context is the story up to that point. And at the level of the individual sentence the relevant context is the segment up to that point.
My model is that LLMs use something like “intuition” rather than “rules” to predict text—even though intuitions can be expressed in terms of mathematical rules, just more fluid ones than we usually see “rules”.
My specific guess is that the gradient descent process that produced GPT has learned to identify high-level patterns/structures in texts (and specifically, stories), and uses them to guide its prediction.
So, perhaps, as it is predicting the next token, it has a “sense” of: -that the text it is writing/predicting is a story -what kind of story it is -which part of the story it is in now -perhaps how the story might end (is this a happy story or a sad story?)
This makes me think of top-down vs bottom-up processing. To some degree, the next token is predicted by the local structures (grammar, sentence structure, etc). To some degree, the next token is predicted by the global structures (the narrative of a story, the overall purpose/intent of the text). (there are also intermediate layers of organization, not just “local” and “global”). I imagine that GPT identifies both the local structures and the global structures (has neuron “clusters” that detect the kind of structures it is familiar with), and synergizes them into its probability outputs for next token prediction.
I understand your argument as something like “GPT is not just predicting the next token because it clearly ‘plans’ further ahead than just the next token”.
But “looking ahead” is required to correctly predict the next token and (I believe) naturally flows of the paradigm of “predicting the next token”.
That is, based on past experience in similar contexts, it makes its best guess about what will happen next. Is that right? How far back does it look?
I’ve been examining stories that are organized on three levels: 1) the whole story, 2) major segments, and 3) sentences within major segments. The relevant past differs within those segments.
At the level of the whole story, at the beginning the relevant past is either prompt that gave rise to the story, or some ChatGPT text in which a story is called for. At the end of the story, ChatGPT may go into a wait state if it is responding to an external prompt, or pick up where it left off if it told the story in the context of something else – a possibility I think I’ll explore a bit. The the level of a major segment, the relevant context is the story up to that point. And at the level of the individual sentence the relevant context is the segment up to that point.
My model is that LLMs use something like “intuition” rather than “rules” to predict text—even though intuitions can be expressed in terms of mathematical rules, just more fluid ones than we usually see “rules”.
My specific guess is that the gradient descent process that produced GPT has learned to identify high-level patterns/structures in texts (and specifically, stories), and uses them to guide its prediction.
So, perhaps, as it is predicting the next token, it has a “sense” of:
-that the text it is writing/predicting is a story
-what kind of story it is
-which part of the story it is in now
-perhaps how the story might end (is this a happy story or a sad story?)
This makes me think of top-down vs bottom-up processing. To some degree, the next token is predicted by the local structures (grammar, sentence structure, etc). To some degree, the next token is predicted by the global structures (the narrative of a story, the overall purpose/intent of the text). (there are also intermediate layers of organization, not just “local” and “global”). I imagine that GPT identifies both the local structures and the global structures (has neuron “clusters” that detect the kind of structures it is familiar with), and synergizes them into its probability outputs for next token prediction.
Makes sense to me.
I wonder if those induction heads identified by the folks at Anthropic played a role in identifying those “high-level patterns/structures in texts...”