>It turns out that using Transformers in the autoregressive mode (with output tokens being added back to the input by concatenating the previous input and the new output token, and sending the new versions of the input through the model again and again) results in them emulating dynamics of recurrent neural networks, and that clarifies things a lot...
I’ll bite: Could you dumb down the implications of the paper a little bit, what is the difference between a Transformer emulating a RNN and some pre-Transformer RNNs and/or not-RNN?
My much more novice-level answer to Hofstadter’s intuition would have been: it’s not the feedforward firing, but it is the gradient descent training of the model on massive scale (both in data and in computation). But apparently you think that something RNN-like about the model structure itself is important?
I think that gradient descent in computation is super-important (this is, apparently, the key mechanism responsible for the phenomenon of few-shot learning).
And, moreover, massive linear combinations of vectors (“artificial attention”) seem to be super-important (the starting point in this sense was adding this kind of artificial attention mechanism to the RNN architecture in 2014).
But apparently you think that something RNN-like about the model structure itself is important?
Yes, this might be related to my personal history, which is that I have been focusing on whether one can express algorithms as neural machines, and whether one can meaningfully speak about continuously deformable programs.
And, then, for Turing completeness one would want both unlimited number of steps and unbounded memory, and there has been a rather involved debate on whether RNNs are more like Turing complete programs, or are they, in practice, only similar to finite automata. (It’s a long topic, on which there is more to say.)
So, from this viewpoint, a machine with a fixed finite number of steps seems very limited.
But autoregressive Transformers are not machines with a fixed finite number of steps, they just commit to emitting a token after a fixed number of steps, but they can continue in an unbounded fashion, so they are very similar to RNNs in this sense.
I’ll bite even further, and ask for the concept of “recurrence” itself to be dumbed down. What is “recurrence”, why is it important, and in what sense does e.g. a feedforward network hooked up to something like MCTS not qualify as relevantly “recurrent”?
(To me one important aspect is whether computation is fundamentally limited to a fixed number of steps vs. having a potentially unbounded loop.
The autoregressive version is an interesting compromise: it’s a fixed number of steps per token, but the answer can unfold in an unbounded fashion.
An interesting tid-bit here is that for traditional RNNs it is one loop iteration per an input token, but in autoregressive Transformers it is one loop iteration per an output token.)
>It turns out that using Transformers in the autoregressive mode (with output tokens being added back to the input by concatenating the previous input and the new output token, and sending the new versions of the input through the model again and again) results in them emulating dynamics of recurrent neural networks, and that clarifies things a lot...
I’ll bite: Could you dumb down the implications of the paper a little bit, what is the difference between a Transformer emulating a RNN and some pre-Transformer RNNs and/or not-RNN?
My much more novice-level answer to Hofstadter’s intuition would have been: it’s not the feedforward firing, but it is the gradient descent training of the model on massive scale (both in data and in computation). But apparently you think that something RNN-like about the model structure itself is important?
I think that gradient descent in computation is super-important (this is, apparently, the key mechanism responsible for the phenomenon of few-shot learning).
And, moreover, massive linear combinations of vectors (“artificial attention”) seem to be super-important (the starting point in this sense was adding this kind of artificial attention mechanism to the RNN architecture in 2014).
Yes, this might be related to my personal history, which is that I have been focusing on whether one can express algorithms as neural machines, and whether one can meaningfully speak about continuously deformable programs.
And, then, for Turing completeness one would want both unlimited number of steps and unbounded memory, and there has been a rather involved debate on whether RNNs are more like Turing complete programs, or are they, in practice, only similar to finite automata. (It’s a long topic, on which there is more to say.)
So, from this viewpoint, a machine with a fixed finite number of steps seems very limited.
But autoregressive Transformers are not machines with a fixed finite number of steps, they just commit to emitting a token after a fixed number of steps, but they can continue in an unbounded fashion, so they are very similar to RNNs in this sense.
I’ll bite even further, and ask for the concept of “recurrence” itself to be dumbed down. What is “recurrence”, why is it important, and in what sense does e.g. a feedforward network hooked up to something like MCTS not qualify as relevantly “recurrent”?
“Hooked up to something” might make a difference.
(To me one important aspect is whether computation is fundamentally limited to a fixed number of steps vs. having a potentially unbounded loop.
The autoregressive version is an interesting compromise: it’s a fixed number of steps per token, but the answer can unfold in an unbounded fashion.
An interesting tid-bit here is that for traditional RNNs it is one loop iteration per an input token, but in autoregressive Transformers it is one loop iteration per an output token.)