Yeah, there’s obviously SOME recursion there but it’s still surprising that such a relatively low bandwidth recursion can still work so well. It’s more akin to me writing down my thoughts and then rereading them to gather my ideas than the kind of loops I imagine our neurons might have.
That said, who knows, maybe the loops in our brain are superfluous, or only useful for learning feedback purposes, and so a neural network trained by an external system doesn’t need them.
I think it seems that way, in your conscious thoughts, but actually there’s a lot more inter-brain-region communication going on simultaneously. I think that without this, you’d see far worse human outputs. And I think once we add something like higher-bandwidth-recursive-thought into language models, we’re going to see a capabilities jump.
It sounds a lot like what we do when we write (as opposed to talk). I recall Kurt Vonnegut once said something like (can’t find cite sry)
‘The reason an author can sound intelligent is because they have the advantage of time. My brain is so slow, people have thought me stupid. But as a writer, I can think at my own speed.’
Think of it this way: how would it feel to chat with someone whose perception of time is 10X slower? Or 100X or 1000X—or, imagine playing chess where your clock was running orders of mag faster than your opponent’s.
Pondering this particular recursion, I noticed that it looks like things change not too much from iteration to iteration of this autoregressive dynamics, because we just add one token each time.
The key property of those artificial recurrent architectures which successfully fight the vanishing gradient problem is that a single iteration of recurrence looks like Identity + epsilon (so, X → X + deltaX for a small deltaX on each iteration, see, for example, this 2018 paper, Overcoming the vanishing gradient problem in plain recurrent networks which explains how this is the case for LSTMs and such, and explains how to achieve this for plain recurrent networks; for a brief explanation see my review of the first version of this paper, Understanding Recurrent Identity Networks).
So, I strongly suspect that it is also the case for the recurrence which is happening in Transformers used in the autoregressive mode (because the input is changing mildly from iteration to iteration).
But I don’t know to which extent this is also true for biological recurrent networks. On one hand, our perceptions seem to change smoothly with time, and that seems to be an argument for gradual change of the X → X + deltaX nature in the biological case as well. But we don’t understand the biological case all that well...
I think recurrence is actually quite important for LLMs. Cf. Janus’ Simulator theory which is now relatively well developed (see e.g. the original Simulators or brief notes I took on the recent status of that theory May-23-2023-status-update). The fact that this is an autoregressive simulation is playing the key role.
But we indeed don’t know whether complexity of biological recurrences vs. relative simplicity of artificial recurrent networks matters much...
I’d speculate that our perceptions just seem to change smoothly because we encode second-order (or even third-order) dynamics in our tokens. From what I layman-understand of consciousness, I’d be surprised if it wasn’t discrete.
Can you explain what you mean by second or third order dynamics? That sounds interesting. Do you mean e.g. the order of the differential equation or something else?
I just mean like, if we see an object move we have a qualia of position but also of velocity/vector and maybe acceleration. So when we see for instance a sphere rolling down an incline, we may have a discrete conscious “frame” where the marble has a velocity of 0 but a positive acceleration, so despite the fact that the next frame is discontinuous with the last one looking only at position, we perceive them as one smooth sequence because the predicted end position of the motion in the first frame is continuous with the start point in the second.
This seems to me the opposite of a low bandwidth recursion. Having access the the entire context window of the previous iteration minus the first token, it should be pretty obvious that most of the relevant information encoded by the values of the nodes in that iteration could in principal be reconstructed, excepting the unlikely event that first token turns out to be extremely important. And it would be pretty weird if much if that information wasn’t actually reconstructed in some sense in the current iteration. An inefficient way to get information from one iteration to the next, if that is your only goal, but plausibly very high bandwidth.
This was thought provoking. While I believe what you said is currently true for the LLMs I’ve used, a sufficiently expensive decoding strategy would overcome it. Might be neat to try this for the specific case you describe. Ask it a question that it would answer correctly with a good prompt style, but use the bad prompt style (asking to give an answer that starts with Yes or No), and watch how the ratio of the cumulative probabilities of Yes* and No* sequences changes as you explore the token sequence tree.
I’d say it’s pretty low bandwidth compared to the wealth of information that must exist in the intermediate layers. Even just the distribution of logits gets collapsed into a single returned value. You could definitely send back more than just that, but the question is whether it’s workable or if it just adds confusion.
Yeah, there’s obviously SOME recursion there but it’s still surprising that such a relatively low bandwidth recursion can still work so well. It’s more akin to me writing down my thoughts and then rereading them to gather my ideas than the kind of loops I imagine our neurons might have.
That said, who knows, maybe the loops in our brain are superfluous, or only useful for learning feedback purposes, and so a neural network trained by an external system doesn’t need them.
In a sense, that is what is happening when you think in words. It’s called the phonological loop.
I think it seems that way, in your conscious thoughts, but actually there’s a lot more inter-brain-region communication going on simultaneously. I think that without this, you’d see far worse human outputs. And I think once we add something like higher-bandwidth-recursive-thought into language models, we’re going to see a capabilities jump.
It sounds a lot like what we do when we write (as opposed to talk). I recall Kurt Vonnegut once said something like (can’t find cite sry)
‘The reason an author can sound intelligent is because they have the advantage of time. My brain is so slow, people have thought me stupid. But as a writer, I can think at my own speed.’
Think of it this way: how would it feel to chat with someone whose perception of time is 10X slower? Or 100X or 1000X—or, imagine playing chess where your clock was running orders of mag faster than your opponent’s.
Pondering this particular recursion, I noticed that it looks like things change not too much from iteration to iteration of this autoregressive dynamics, because we just add one token each time.
The key property of those artificial recurrent architectures which successfully fight the vanishing gradient problem is that a single iteration of recurrence looks like Identity + epsilon (so, X → X + deltaX for a small deltaX on each iteration, see, for example, this 2018 paper, Overcoming the vanishing gradient problem in plain recurrent networks which explains how this is the case for LSTMs and such, and explains how to achieve this for plain recurrent networks; for a brief explanation see my review of the first version of this paper, Understanding Recurrent Identity Networks).
So, I strongly suspect that it is also the case for the recurrence which is happening in Transformers used in the autoregressive mode (because the input is changing mildly from iteration to iteration).
But I don’t know to which extent this is also true for biological recurrent networks. On one hand, our perceptions seem to change smoothly with time, and that seems to be an argument for gradual change of the X → X + deltaX nature in the biological case as well. But we don’t understand the biological case all that well...
I think recurrence is actually quite important for LLMs. Cf. Janus’ Simulator theory which is now relatively well developed (see e.g. the original Simulators or brief notes I took on the recent status of that theory May-23-2023-status-update). The fact that this is an autoregressive simulation is playing the key role.
But we indeed don’t know whether complexity of biological recurrences vs. relative simplicity of artificial recurrent networks matters much...
I’d speculate that our perceptions just seem to change smoothly because we encode second-order (or even third-order) dynamics in our tokens. From what I layman-understand of consciousness, I’d be surprised if it wasn’t discrete.
Can you explain what you mean by second or third order dynamics? That sounds interesting. Do you mean e.g. the order of the differential equation or something else?
I just mean like, if we see an object move we have a qualia of position but also of velocity/vector and maybe acceleration. So when we see for instance a sphere rolling down an incline, we may have a discrete conscious “frame” where the marble has a velocity of 0 but a positive acceleration, so despite the fact that the next frame is discontinuous with the last one looking only at position, we perceive them as one smooth sequence because the predicted end position of the motion in the first frame is continuous with the start point in the second.
This seems to me the opposite of a low bandwidth recursion. Having access the the entire context window of the previous iteration minus the first token, it should be pretty obvious that most of the relevant information encoded by the values of the nodes in that iteration could in principal be reconstructed, excepting the unlikely event that first token turns out to be extremely important. And it would be pretty weird if much if that information wasn’t actually reconstructed in some sense in the current iteration. An inefficient way to get information from one iteration to the next, if that is your only goal, but plausibly very high bandwidth.
Which is why asking an LLM to give an answer that starts with “Yes” or “No” and then gives an explanation is the worst possible way to do it.
This was thought provoking. While I believe what you said is currently true for the LLMs I’ve used, a sufficiently expensive decoding strategy would overcome it. Might be neat to try this for the specific case you describe. Ask it a question that it would answer correctly with a good prompt style, but use the bad prompt style (asking to give an answer that starts with Yes or No), and watch how the ratio of the cumulative probabilities of Yes* and No* sequences changes as you explore the token sequence tree.
I’d say it’s pretty low bandwidth compared to the wealth of information that must exist in the intermediate layers. Even just the distribution of logits gets collapsed into a single returned value. You could definitely send back more than just that, but the question is whether it’s workable or if it just adds confusion.
The loops in our neurons can’t be that great, otherwise I wouldn’t benefit so much from writing down my thoughts and then rereading them. :P
(Not a serious disagreement with you, I think I agree overall)
It could also be that LLMs don’t do it like we do it and simply offer a computationally sufficient platform.