For those readers who hope to make use of AI romantic companions, I do also have some warnings:
You should know in a rough sense how the AI works and the ways in which it’s not a human.
For most current LLMs, a very important point is that they have no memory, other than the text they read in a context window. When generating each token, they “re-read” everything in the context window before predicting. None of their internal calculations are preserved when predicting the next token, everything is forgotten and the entire context window is re-read again.
LLMs can be quite dumb, not always in the ways a human would expect. Some of this is to do with the wacky way we force them to generate text, see above.
A human might think about you even if they’re not actively talking to you, but rather just going about their day. Of course, most of the time they aren’t thinking about you at all, their personality is continually developing and changing based on the events of their lives. LLMs don’t go about their day or have an independent existence at all really, they’re just there to respond to prompts.
In the future, some of these facts may change, the AIs may become more human-like, or at least more agent-like. You should know all such details about your AI companion of choice.
Not your weights, not your AI GF/BF
What hot new startups can give, hot new startups can take away. If you’re going to have an emotional attachment to one of these things, it’s only prudent to make sure your ability to run it is independent of the whims and financial fortunes of some random company. Download the weights, keep an up to date local copy of all the context the AI uses as its “memory”.
See point 1, knowing roughly how the thing works is helpful for this.
When generating each token, they “re-read” everything in the context window before predicting. None of their internal calculations are preserved when predicting the next token, everything is forgotten and the entire context window is re-read again.
Given that KV caching is a thing, the way I chose to phrase this is very misleading / outright wrong in retrospect. While of course inference could be done in this way, it’s not the most efficient, and one could even make a similar statement about certain inefficient ways of simulating a person’s thoughts.
If I were to rephrase, I’d put it this way: “Any sufficiently long serial computation the model performs must be mediated by the stream of tokens. Internal activations can only be passed forwards to the next layer of the model, and there are only a finite number of layers. Hence, if information must be processed in more sequential steps than there are layers, the only option is for that information to be written out to the token stream, then processed further from there.”
I think it’s a very important thing to know about Transformers, as our intuition about these models is that there must be some sort of hidden state or on the fly adaptation, and this is at least potentially true of other models. (For example, in RNNs, it’s a useful trick to run the RNN through the ‘context window’ and then loop back around and input the final hidden state at the beginning, and ‘reread’ the ‘context window’ before dealing with new input. Or there’s dynamic evaluation, where the RNN is trained on the fly, for much better results, and that is very unlike almost all Transformer uses. And of course, RNNs have long had various kinds of adaptive computation where they can update the hidden state repeatedly on repeated or null inputs to ‘ponder’.)
But I don’t think your rewrite is better, because it’s focused on a different thing entirely, and loses the Memento-like aspect of how Transformers work—that there is nothing ‘outside’ the context window. The KV cache strikes me as quibbling: the KV cache is more efficient, but it works only because it is mathematically identical and is caching the computations which are identical every time.
I would just rewrite that as something like,
Transformers have no memory and do not change or learn from session to session. A Transformer must read everything in the context window before predicting the next token. If that token is added to the context, the Transformer repeats all the same calculations and then some more for the new token. It doesn’t “remember” having predicted that token before. Mathematically, it is as if each token you want to generate, the Transformer wakes up and sees the context window for the first time. (This can be, and usually is, highly optimized to avoid actually repeating those calculations, but the output is the same.)
So, if something is not in the current context window, then for a Transformer, it never existed. This means it is limited to thinking only about what is in the current context window, for a short time, until it predicts the next token. And then the next Transformer has to start from scratch when it wakes up and sees the new context window.
Good point, the whole “model treats tokens it previously produced and tokens that are part of the input exactly the same” thing and the whole “model doesn’t learn across usages” thing are also very important.
For those readers who hope to make use of AI romantic companions, I do also have some warnings:
You should know in a rough sense how the AI works and the ways in which it’s not a human.
For most current LLMs, a very important point is that they have no memory, other than the text they read in a context window. When generating each token, they “re-read” everything in the context window before predicting. None of their internal calculations are preserved when predicting the next token, everything is forgotten and the entire context window is re-read again.
LLMs can be quite dumb, not always in the ways a human would expect. Some of this is to do with the wacky way we force them to generate text, see above.
A human might think about you even if they’re not actively talking to you, but rather just going about their day. Of course, most of the time they aren’t thinking about you at all, their personality is continually developing and changing based on the events of their lives. LLMs don’t go about their day or have an independent existence at all really, they’re just there to respond to prompts.
In the future, some of these facts may change, the AIs may become more human-like, or at least more agent-like. You should know all such details about your AI companion of choice.
Not your weights, not your AI GF/BF
What hot new startups can give, hot new startups can take away. If you’re going to have an emotional attachment to one of these things, it’s only prudent to make sure your ability to run it is independent of the whims and financial fortunes of some random company. Download the weights, keep an up to date local copy of all the context the AI uses as its “memory”.
See point 1, knowing roughly how the thing works is helpful for this.
A backup you haven’t tested isn’t a backup.
Given that KV caching is a thing, the way I chose to phrase this is very misleading / outright wrong in retrospect. While of course inference could be done in this way, it’s not the most efficient, and one could even make a similar statement about certain inefficient ways of simulating a person’s thoughts.
If I were to rephrase, I’d put it this way: “Any sufficiently long serial computation the model performs must be mediated by the stream of tokens. Internal activations can only be passed forwards to the next layer of the model, and there are only a finite number of layers. Hence, if information must be processed in more sequential steps than there are layers, the only option is for that information to be written out to the token stream, then processed further from there.”
I think it’s a very important thing to know about Transformers, as our intuition about these models is that there must be some sort of hidden state or on the fly adaptation, and this is at least potentially true of other models. (For example, in RNNs, it’s a useful trick to run the RNN through the ‘context window’ and then loop back around and input the final hidden state at the beginning, and ‘reread’ the ‘context window’ before dealing with new input. Or there’s dynamic evaluation, where the RNN is trained on the fly, for much better results, and that is very unlike almost all Transformer uses. And of course, RNNs have long had various kinds of adaptive computation where they can update the hidden state repeatedly on repeated or null inputs to ‘ponder’.)
But I don’t think your rewrite is better, because it’s focused on a different thing entirely, and loses the Memento-like aspect of how Transformers work—that there is nothing ‘outside’ the context window. The KV cache strikes me as quibbling: the KV cache is more efficient, but it works only because it is mathematically identical and is caching the computations which are identical every time.
I would just rewrite that as something like,
Good point, the whole “model treats tokens it previously produced and tokens that are part of the input exactly the same” thing and the whole “model doesn’t learn across usages” thing are also very important.