But one thing that has completely surprised me is that these LLMs and other systems like them are all feed-forward. It’s like the firing of the neurons is going only in one direction. And I would never have thought that deep thinking could come out of a network that only goes in one direction, out of firing neurons in only one direction. And that doesn’t make sense to me, but that just shows that I’m naive.
It turns out that using Transformers in the autoregressive mode (with output tokens being added back to the input by concatenating the previous input and the new output token, and sending the new versions of the input through the model again and again) results in them emulating dynamics of recurrent neural networks, and that clarifies things a lot...
Yeah, there’s obviously SOME recursion there but it’s still surprising that such a relatively low bandwidth recursion can still work so well. It’s more akin to me writing down my thoughts and then rereading them to gather my ideas than the kind of loops I imagine our neurons might have.
That said, who knows, maybe the loops in our brain are superfluous, or only useful for learning feedback purposes, and so a neural network trained by an external system doesn’t need them.
I think it seems that way, in your conscious thoughts, but actually there’s a lot more inter-brain-region communication going on simultaneously. I think that without this, you’d see far worse human outputs. And I think once we add something like higher-bandwidth-recursive-thought into language models, we’re going to see a capabilities jump.
It sounds a lot like what we do when we write (as opposed to talk). I recall Kurt Vonnegut once said something like (can’t find cite sry)
‘The reason an author can sound intelligent is because they have the advantage of time. My brain is so slow, people have thought me stupid. But as a writer, I can think at my own speed.’
Think of it this way: how would it feel to chat with someone whose perception of time is 10X slower? Or 100X or 1000X—or, imagine playing chess where your clock was running orders of mag faster than your opponent’s.
Pondering this particular recursion, I noticed that it looks like things change not too much from iteration to iteration of this autoregressive dynamics, because we just add one token each time.
The key property of those artificial recurrent architectures which successfully fight the vanishing gradient problem is that a single iteration of recurrence looks like Identity + epsilon (so, X → X + deltaX for a small deltaX on each iteration, see, for example, this 2018 paper, Overcoming the vanishing gradient problem in plain recurrent networks which explains how this is the case for LSTMs and such, and explains how to achieve this for plain recurrent networks; for a brief explanation see my review of the first version of this paper, Understanding Recurrent Identity Networks).
So, I strongly suspect that it is also the case for the recurrence which is happening in Transformers used in the autoregressive mode (because the input is changing mildly from iteration to iteration).
But I don’t know to which extent this is also true for biological recurrent networks. On one hand, our perceptions seem to change smoothly with time, and that seems to be an argument for gradual change of the X → X + deltaX nature in the biological case as well. But we don’t understand the biological case all that well...
I think recurrence is actually quite important for LLMs. Cf. Janus’ Simulator theory which is now relatively well developed (see e.g. the original Simulators or brief notes I took on the recent status of that theory May-23-2023-status-update). The fact that this is an autoregressive simulation is playing the key role.
But we indeed don’t know whether complexity of biological recurrences vs. relative simplicity of artificial recurrent networks matters much...
I’d speculate that our perceptions just seem to change smoothly because we encode second-order (or even third-order) dynamics in our tokens. From what I layman-understand of consciousness, I’d be surprised if it wasn’t discrete.
Can you explain what you mean by second or third order dynamics? That sounds interesting. Do you mean e.g. the order of the differential equation or something else?
I just mean like, if we see an object move we have a qualia of position but also of velocity/vector and maybe acceleration. So when we see for instance a sphere rolling down an incline, we may have a discrete conscious “frame” where the marble has a velocity of 0 but a positive acceleration, so despite the fact that the next frame is discontinuous with the last one looking only at position, we perceive them as one smooth sequence because the predicted end position of the motion in the first frame is continuous with the start point in the second.
This seems to me the opposite of a low bandwidth recursion. Having access the the entire context window of the previous iteration minus the first token, it should be pretty obvious that most of the relevant information encoded by the values of the nodes in that iteration could in principal be reconstructed, excepting the unlikely event that first token turns out to be extremely important. And it would be pretty weird if much if that information wasn’t actually reconstructed in some sense in the current iteration. An inefficient way to get information from one iteration to the next, if that is your only goal, but plausibly very high bandwidth.
This was thought provoking. While I believe what you said is currently true for the LLMs I’ve used, a sufficiently expensive decoding strategy would overcome it. Might be neat to try this for the specific case you describe. Ask it a question that it would answer correctly with a good prompt style, but use the bad prompt style (asking to give an answer that starts with Yes or No), and watch how the ratio of the cumulative probabilities of Yes* and No* sequences changes as you explore the token sequence tree.
I’d say it’s pretty low bandwidth compared to the wealth of information that must exist in the intermediate layers. Even just the distribution of logits gets collapsed into a single returned value. You could definitely send back more than just that, but the question is whether it’s workable or if it just adds confusion.
Being an autoregressive language model is like having a strange form of amnesia, where you forget everything you thought about so far as soon as you utter a new word, and you can remember only what you said before.
that paper is one of many claiming some linear attention mechanism that’s as good as full self attention. in practice they’re all sufficiently much worse that nobody uses them except the original authors in the original paper, usually not even the original authors in subsequent papers.
the one exception is flash attention, which is basically just a very fancy fused kernel for the same computation (actually the same, up to numerical error, unlike all these “linear attention” papers).
>It turns out that using Transformers in the autoregressive mode (with output tokens being added back to the input by concatenating the previous input and the new output token, and sending the new versions of the input through the model again and again) results in them emulating dynamics of recurrent neural networks, and that clarifies things a lot...
I’ll bite: Could you dumb down the implications of the paper a little bit, what is the difference between a Transformer emulating a RNN and some pre-Transformer RNNs and/or not-RNN?
My much more novice-level answer to Hofstadter’s intuition would have been: it’s not the feedforward firing, but it is the gradient descent training of the model on massive scale (both in data and in computation). But apparently you think that something RNN-like about the model structure itself is important?
I think that gradient descent in computation is super-important (this is, apparently, the key mechanism responsible for the phenomenon of few-shot learning).
And, moreover, massive linear combinations of vectors (“artificial attention”) seem to be super-important (the starting point in this sense was adding this kind of artificial attention mechanism to the RNN architecture in 2014).
But apparently you think that something RNN-like about the model structure itself is important?
Yes, this might be related to my personal history, which is that I have been focusing on whether one can express algorithms as neural machines, and whether one can meaningfully speak about continuously deformable programs.
And, then, for Turing completeness one would want both unlimited number of steps and unbounded memory, and there has been a rather involved debate on whether RNNs are more like Turing complete programs, or are they, in practice, only similar to finite automata. (It’s a long topic, on which there is more to say.)
So, from this viewpoint, a machine with a fixed finite number of steps seems very limited.
But autoregressive Transformers are not machines with a fixed finite number of steps, they just commit to emitting a token after a fixed number of steps, but they can continue in an unbounded fashion, so they are very similar to RNNs in this sense.
I’ll bite even further, and ask for the concept of “recurrence” itself to be dumbed down. What is “recurrence”, why is it important, and in what sense does e.g. a feedforward network hooked up to something like MCTS not qualify as relevantly “recurrent”?
(To me one important aspect is whether computation is fundamentally limited to a fixed number of steps vs. having a potentially unbounded loop.
The autoregressive version is an interesting compromise: it’s a fixed number of steps per token, but the answer can unfold in an unbounded fashion.
An interesting tid-bit here is that for traditional RNNs it is one loop iteration per an input token, but in autoregressive Transformers it is one loop iteration per an output token.)
I felt exactly the same, until I had read this June 2020 paper: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention.
It turns out that using Transformers in the autoregressive mode (with output tokens being added back to the input by concatenating the previous input and the new output token, and sending the new versions of the input through the model again and again) results in them emulating dynamics of recurrent neural networks, and that clarifies things a lot...
Yeah, there’s obviously SOME recursion there but it’s still surprising that such a relatively low bandwidth recursion can still work so well. It’s more akin to me writing down my thoughts and then rereading them to gather my ideas than the kind of loops I imagine our neurons might have.
That said, who knows, maybe the loops in our brain are superfluous, or only useful for learning feedback purposes, and so a neural network trained by an external system doesn’t need them.
In a sense, that is what is happening when you think in words. It’s called the phonological loop.
I think it seems that way, in your conscious thoughts, but actually there’s a lot more inter-brain-region communication going on simultaneously. I think that without this, you’d see far worse human outputs. And I think once we add something like higher-bandwidth-recursive-thought into language models, we’re going to see a capabilities jump.
It sounds a lot like what we do when we write (as opposed to talk). I recall Kurt Vonnegut once said something like (can’t find cite sry)
‘The reason an author can sound intelligent is because they have the advantage of time. My brain is so slow, people have thought me stupid. But as a writer, I can think at my own speed.’
Think of it this way: how would it feel to chat with someone whose perception of time is 10X slower? Or 100X or 1000X—or, imagine playing chess where your clock was running orders of mag faster than your opponent’s.
Pondering this particular recursion, I noticed that it looks like things change not too much from iteration to iteration of this autoregressive dynamics, because we just add one token each time.
The key property of those artificial recurrent architectures which successfully fight the vanishing gradient problem is that a single iteration of recurrence looks like Identity + epsilon (so, X → X + deltaX for a small deltaX on each iteration, see, for example, this 2018 paper, Overcoming the vanishing gradient problem in plain recurrent networks which explains how this is the case for LSTMs and such, and explains how to achieve this for plain recurrent networks; for a brief explanation see my review of the first version of this paper, Understanding Recurrent Identity Networks).
So, I strongly suspect that it is also the case for the recurrence which is happening in Transformers used in the autoregressive mode (because the input is changing mildly from iteration to iteration).
But I don’t know to which extent this is also true for biological recurrent networks. On one hand, our perceptions seem to change smoothly with time, and that seems to be an argument for gradual change of the X → X + deltaX nature in the biological case as well. But we don’t understand the biological case all that well...
I think recurrence is actually quite important for LLMs. Cf. Janus’ Simulator theory which is now relatively well developed (see e.g. the original Simulators or brief notes I took on the recent status of that theory May-23-2023-status-update). The fact that this is an autoregressive simulation is playing the key role.
But we indeed don’t know whether complexity of biological recurrences vs. relative simplicity of artificial recurrent networks matters much...
I’d speculate that our perceptions just seem to change smoothly because we encode second-order (or even third-order) dynamics in our tokens. From what I layman-understand of consciousness, I’d be surprised if it wasn’t discrete.
Can you explain what you mean by second or third order dynamics? That sounds interesting. Do you mean e.g. the order of the differential equation or something else?
I just mean like, if we see an object move we have a qualia of position but also of velocity/vector and maybe acceleration. So when we see for instance a sphere rolling down an incline, we may have a discrete conscious “frame” where the marble has a velocity of 0 but a positive acceleration, so despite the fact that the next frame is discontinuous with the last one looking only at position, we perceive them as one smooth sequence because the predicted end position of the motion in the first frame is continuous with the start point in the second.
This seems to me the opposite of a low bandwidth recursion. Having access the the entire context window of the previous iteration minus the first token, it should be pretty obvious that most of the relevant information encoded by the values of the nodes in that iteration could in principal be reconstructed, excepting the unlikely event that first token turns out to be extremely important. And it would be pretty weird if much if that information wasn’t actually reconstructed in some sense in the current iteration. An inefficient way to get information from one iteration to the next, if that is your only goal, but plausibly very high bandwidth.
Which is why asking an LLM to give an answer that starts with “Yes” or “No” and then gives an explanation is the worst possible way to do it.
This was thought provoking. While I believe what you said is currently true for the LLMs I’ve used, a sufficiently expensive decoding strategy would overcome it. Might be neat to try this for the specific case you describe. Ask it a question that it would answer correctly with a good prompt style, but use the bad prompt style (asking to give an answer that starts with Yes or No), and watch how the ratio of the cumulative probabilities of Yes* and No* sequences changes as you explore the token sequence tree.
I’d say it’s pretty low bandwidth compared to the wealth of information that must exist in the intermediate layers. Even just the distribution of logits gets collapsed into a single returned value. You could definitely send back more than just that, but the question is whether it’s workable or if it just adds confusion.
The loops in our neurons can’t be that great, otherwise I wouldn’t benefit so much from writing down my thoughts and then rereading them. :P
(Not a serious disagreement with you, I think I agree overall)
It could also be that LLMs don’t do it like we do it and simply offer a computationally sufficient platform.
In what sense do they emulate these dynamics?
The formulas and a brief discussion are in Section 3.4 (page 5) of https://arxiv.org/abs/2006.16236
Thanks!
Further discussion on Twitter of feedforward vs recurrent.
Thanks!
Being an autoregressive language model is like having a strange form of amnesia, where you forget everything you thought about so far as soon as you utter a new word, and you can remember only what you said before.
that paper is one of many claiming some linear attention mechanism that’s as good as full self attention. in practice they’re all sufficiently much worse that nobody uses them except the original authors in the original paper, usually not even the original authors in subsequent papers.
the one exception is flash attention, which is basically just a very fancy fused kernel for the same computation (actually the same, up to numerical error, unlike all these “linear attention” papers).
>It turns out that using Transformers in the autoregressive mode (with output tokens being added back to the input by concatenating the previous input and the new output token, and sending the new versions of the input through the model again and again) results in them emulating dynamics of recurrent neural networks, and that clarifies things a lot...
I’ll bite: Could you dumb down the implications of the paper a little bit, what is the difference between a Transformer emulating a RNN and some pre-Transformer RNNs and/or not-RNN?
My much more novice-level answer to Hofstadter’s intuition would have been: it’s not the feedforward firing, but it is the gradient descent training of the model on massive scale (both in data and in computation). But apparently you think that something RNN-like about the model structure itself is important?
I think that gradient descent in computation is super-important (this is, apparently, the key mechanism responsible for the phenomenon of few-shot learning).
And, moreover, massive linear combinations of vectors (“artificial attention”) seem to be super-important (the starting point in this sense was adding this kind of artificial attention mechanism to the RNN architecture in 2014).
Yes, this might be related to my personal history, which is that I have been focusing on whether one can express algorithms as neural machines, and whether one can meaningfully speak about continuously deformable programs.
And, then, for Turing completeness one would want both unlimited number of steps and unbounded memory, and there has been a rather involved debate on whether RNNs are more like Turing complete programs, or are they, in practice, only similar to finite automata. (It’s a long topic, on which there is more to say.)
So, from this viewpoint, a machine with a fixed finite number of steps seems very limited.
But autoregressive Transformers are not machines with a fixed finite number of steps, they just commit to emitting a token after a fixed number of steps, but they can continue in an unbounded fashion, so they are very similar to RNNs in this sense.
I’ll bite even further, and ask for the concept of “recurrence” itself to be dumbed down. What is “recurrence”, why is it important, and in what sense does e.g. a feedforward network hooked up to something like MCTS not qualify as relevantly “recurrent”?
“Hooked up to something” might make a difference.
(To me one important aspect is whether computation is fundamentally limited to a fixed number of steps vs. having a potentially unbounded loop.
The autoregressive version is an interesting compromise: it’s a fixed number of steps per token, but the answer can unfold in an unbounded fashion.
An interesting tid-bit here is that for traditional RNNs it is one loop iteration per an input token, but in autoregressive Transformers it is one loop iteration per an output token.)