Pondering this particular recursion, I noticed that it looks like things change not too much from iteration to iteration of this autoregressive dynamics, because we just add one token each time.
The key property of those artificial recurrent architectures which successfully fight the vanishing gradient problem is that a single iteration of recurrence looks like Identity + epsilon (so, X → X + deltaX for a small deltaX on each iteration, see, for example, this 2018 paper, Overcoming the vanishing gradient problem in plain recurrent networks which explains how this is the case for LSTMs and such, and explains how to achieve this for plain recurrent networks; for a brief explanation see my review of the first version of this paper, Understanding Recurrent Identity Networks).
So, I strongly suspect that it is also the case for the recurrence which is happening in Transformers used in the autoregressive mode (because the input is changing mildly from iteration to iteration).
But I don’t know to which extent this is also true for biological recurrent networks. On one hand, our perceptions seem to change smoothly with time, and that seems to be an argument for gradual change of the X → X + deltaX nature in the biological case as well. But we don’t understand the biological case all that well...
I think recurrence is actually quite important for LLMs. Cf. Janus’ Simulator theory which is now relatively well developed (see e.g. the original Simulators or brief notes I took on the recent status of that theory May-23-2023-status-update). The fact that this is an autoregressive simulation is playing the key role.
But we indeed don’t know whether complexity of biological recurrences vs. relative simplicity of artificial recurrent networks matters much...
I’d speculate that our perceptions just seem to change smoothly because we encode second-order (or even third-order) dynamics in our tokens. From what I layman-understand of consciousness, I’d be surprised if it wasn’t discrete.
Can you explain what you mean by second or third order dynamics? That sounds interesting. Do you mean e.g. the order of the differential equation or something else?
I just mean like, if we see an object move we have a qualia of position but also of velocity/vector and maybe acceleration. So when we see for instance a sphere rolling down an incline, we may have a discrete conscious “frame” where the marble has a velocity of 0 but a positive acceleration, so despite the fact that the next frame is discontinuous with the last one looking only at position, we perceive them as one smooth sequence because the predicted end position of the motion in the first frame is continuous with the start point in the second.
Pondering this particular recursion, I noticed that it looks like things change not too much from iteration to iteration of this autoregressive dynamics, because we just add one token each time.
The key property of those artificial recurrent architectures which successfully fight the vanishing gradient problem is that a single iteration of recurrence looks like Identity + epsilon (so, X → X + deltaX for a small deltaX on each iteration, see, for example, this 2018 paper, Overcoming the vanishing gradient problem in plain recurrent networks which explains how this is the case for LSTMs and such, and explains how to achieve this for plain recurrent networks; for a brief explanation see my review of the first version of this paper, Understanding Recurrent Identity Networks).
So, I strongly suspect that it is also the case for the recurrence which is happening in Transformers used in the autoregressive mode (because the input is changing mildly from iteration to iteration).
But I don’t know to which extent this is also true for biological recurrent networks. On one hand, our perceptions seem to change smoothly with time, and that seems to be an argument for gradual change of the X → X + deltaX nature in the biological case as well. But we don’t understand the biological case all that well...
I think recurrence is actually quite important for LLMs. Cf. Janus’ Simulator theory which is now relatively well developed (see e.g. the original Simulators or brief notes I took on the recent status of that theory May-23-2023-status-update). The fact that this is an autoregressive simulation is playing the key role.
But we indeed don’t know whether complexity of biological recurrences vs. relative simplicity of artificial recurrent networks matters much...
I’d speculate that our perceptions just seem to change smoothly because we encode second-order (or even third-order) dynamics in our tokens. From what I layman-understand of consciousness, I’d be surprised if it wasn’t discrete.
Can you explain what you mean by second or third order dynamics? That sounds interesting. Do you mean e.g. the order of the differential equation or something else?
I just mean like, if we see an object move we have a qualia of position but also of velocity/vector and maybe acceleration. So when we see for instance a sphere rolling down an incline, we may have a discrete conscious “frame” where the marble has a velocity of 0 but a positive acceleration, so despite the fact that the next frame is discontinuous with the last one looking only at position, we perceive them as one smooth sequence because the predicted end position of the motion in the first frame is continuous with the start point in the second.