I think it’s a very important thing to know about Transformers, as our intuition about these models is that there must be some sort of hidden state or on the fly adaptation, and this is at least potentially true of other models. (For example, in RNNs, it’s a useful trick to run the RNN through the ‘context window’ and then loop back around and input the final hidden state at the beginning, and ‘reread’ the ‘context window’ before dealing with new input. Or there’s dynamic evaluation, where the RNN is trained on the fly, for much better results, and that is very unlike almost all Transformer uses. And of course, RNNs have long had various kinds of adaptive computation where they can update the hidden state repeatedly on repeated or null inputs to ‘ponder’.)
But I don’t think your rewrite is better, because it’s focused on a different thing entirely, and loses the Memento-like aspect of how Transformers work—that there is nothing ‘outside’ the context window. The KV cache strikes me as quibbling: the KV cache is more efficient, but it works only because it is mathematically identical and is caching the computations which are identical every time.
I would just rewrite that as something like,
Transformers have no memory and do not change or learn from session to session. A Transformer must read everything in the context window before predicting the next token. If that token is added to the context, the Transformer repeats all the same calculations and then some more for the new token. It doesn’t “remember” having predicted that token before. Mathematically, it is as if each token you want to generate, the Transformer wakes up and sees the context window for the first time. (This can be, and usually is, highly optimized to avoid actually repeating those calculations, but the output is the same.)
So, if something is not in the current context window, then for a Transformer, it never existed. This means it is limited to thinking only about what is in the current context window, for a short time, until it predicts the next token. And then the next Transformer has to start from scratch when it wakes up and sees the new context window.
Good point, the whole “model treats tokens it previously produced and tokens that are part of the input exactly the same” thing and the whole “model doesn’t learn across usages” thing are also very important.
I think it’s a very important thing to know about Transformers, as our intuition about these models is that there must be some sort of hidden state or on the fly adaptation, and this is at least potentially true of other models. (For example, in RNNs, it’s a useful trick to run the RNN through the ‘context window’ and then loop back around and input the final hidden state at the beginning, and ‘reread’ the ‘context window’ before dealing with new input. Or there’s dynamic evaluation, where the RNN is trained on the fly, for much better results, and that is very unlike almost all Transformer uses. And of course, RNNs have long had various kinds of adaptive computation where they can update the hidden state repeatedly on repeated or null inputs to ‘ponder’.)
But I don’t think your rewrite is better, because it’s focused on a different thing entirely, and loses the Memento-like aspect of how Transformers work—that there is nothing ‘outside’ the context window. The KV cache strikes me as quibbling: the KV cache is more efficient, but it works only because it is mathematically identical and is caching the computations which are identical every time.
I would just rewrite that as something like,
Good point, the whole “model treats tokens it previously produced and tokens that are part of the input exactly the same” thing and the whole “model doesn’t learn across usages” thing are also very important.