When generating each token, they “re-read” everything in the context window before predicting. None of their internal calculations are preserved when predicting the next token, everything is forgotten and the entire context window is re-read again.
Given that KV caching is a thing, the way I chose to phrase this is very misleading / outright wrong in retrospect. While of course inference could be done in this way, it’s not the most efficient, and one could even make a similar statement about certain inefficient ways of simulating a person’s thoughts.
If I were to rephrase, I’d put it this way: “Any sufficiently long serial computation the model performs must be mediated by the stream of tokens. Internal activations can only be passed forwards to the next layer of the model, and there are only a finite number of layers. Hence, if information must be processed in more sequential steps than there are layers, the only option is for that information to be written out to the token stream, then processed further from there.”
I think it’s a very important thing to know about Transformers, as our intuition about these models is that there must be some sort of hidden state or on the fly adaptation, and this is at least potentially true of other models. (For example, in RNNs, it’s a useful trick to run the RNN through the ‘context window’ and then loop back around and input the final hidden state at the beginning, and ‘reread’ the ‘context window’ before dealing with new input. Or there’s dynamic evaluation, where the RNN is trained on the fly, for much better results, and that is very unlike almost all Transformer uses. And of course, RNNs have long had various kinds of adaptive computation where they can update the hidden state repeatedly on repeated or null inputs to ‘ponder’.)
But I don’t think your rewrite is better, because it’s focused on a different thing entirely, and loses the Memento-like aspect of how Transformers work—that there is nothing ‘outside’ the context window. The KV cache strikes me as quibbling: the KV cache is more efficient, but it works only because it is mathematically identical and is caching the computations which are identical every time.
I would just rewrite that as something like,
Transformers have no memory and do not change or learn from session to session. A Transformer must read everything in the context window before predicting the next token. If that token is added to the context, the Transformer repeats all the same calculations and then some more for the new token. It doesn’t “remember” having predicted that token before. Mathematically, it is as if each token you want to generate, the Transformer wakes up and sees the context window for the first time. (This can be, and usually is, highly optimized to avoid actually repeating those calculations, but the output is the same.)
So, if something is not in the current context window, then for a Transformer, it never existed. This means it is limited to thinking only about what is in the current context window, for a short time, until it predicts the next token. And then the next Transformer has to start from scratch when it wakes up and sees the new context window.
Good point, the whole “model treats tokens it previously produced and tokens that are part of the input exactly the same” thing and the whole “model doesn’t learn across usages” thing are also very important.
Given that KV caching is a thing, the way I chose to phrase this is very misleading / outright wrong in retrospect. While of course inference could be done in this way, it’s not the most efficient, and one could even make a similar statement about certain inefficient ways of simulating a person’s thoughts.
If I were to rephrase, I’d put it this way: “Any sufficiently long serial computation the model performs must be mediated by the stream of tokens. Internal activations can only be passed forwards to the next layer of the model, and there are only a finite number of layers. Hence, if information must be processed in more sequential steps than there are layers, the only option is for that information to be written out to the token stream, then processed further from there.”
I think it’s a very important thing to know about Transformers, as our intuition about these models is that there must be some sort of hidden state or on the fly adaptation, and this is at least potentially true of other models. (For example, in RNNs, it’s a useful trick to run the RNN through the ‘context window’ and then loop back around and input the final hidden state at the beginning, and ‘reread’ the ‘context window’ before dealing with new input. Or there’s dynamic evaluation, where the RNN is trained on the fly, for much better results, and that is very unlike almost all Transformer uses. And of course, RNNs have long had various kinds of adaptive computation where they can update the hidden state repeatedly on repeated or null inputs to ‘ponder’.)
But I don’t think your rewrite is better, because it’s focused on a different thing entirely, and loses the Memento-like aspect of how Transformers work—that there is nothing ‘outside’ the context window. The KV cache strikes me as quibbling: the KV cache is more efficient, but it works only because it is mathematically identical and is caching the computations which are identical every time.
I would just rewrite that as something like,
Good point, the whole “model treats tokens it previously produced and tokens that are part of the input exactly the same” thing and the whole “model doesn’t learn across usages” thing are also very important.