I think that gradient descent in computation is super-important (this is, apparently, the key mechanism responsible for the phenomenon of few-shot learning).
And, moreover, massive linear combinations of vectors (“artificial attention”) seem to be super-important (the starting point in this sense was adding this kind of artificial attention mechanism to the RNN architecture in 2014).
But apparently you think that something RNN-like about the model structure itself is important?
Yes, this might be related to my personal history, which is that I have been focusing on whether one can express algorithms as neural machines, and whether one can meaningfully speak about continuously deformable programs.
And, then, for Turing completeness one would want both unlimited number of steps and unbounded memory, and there has been a rather involved debate on whether RNNs are more like Turing complete programs, or are they, in practice, only similar to finite automata. (It’s a long topic, on which there is more to say.)
So, from this viewpoint, a machine with a fixed finite number of steps seems very limited.
But autoregressive Transformers are not machines with a fixed finite number of steps, they just commit to emitting a token after a fixed number of steps, but they can continue in an unbounded fashion, so they are very similar to RNNs in this sense.
I think that gradient descent in computation is super-important (this is, apparently, the key mechanism responsible for the phenomenon of few-shot learning).
And, moreover, massive linear combinations of vectors (“artificial attention”) seem to be super-important (the starting point in this sense was adding this kind of artificial attention mechanism to the RNN architecture in 2014).
Yes, this might be related to my personal history, which is that I have been focusing on whether one can express algorithms as neural machines, and whether one can meaningfully speak about continuously deformable programs.
And, then, for Turing completeness one would want both unlimited number of steps and unbounded memory, and there has been a rather involved debate on whether RNNs are more like Turing complete programs, or are they, in practice, only similar to finite automata. (It’s a long topic, on which there is more to say.)
So, from this viewpoint, a machine with a fixed finite number of steps seems very limited.
But autoregressive Transformers are not machines with a fixed finite number of steps, they just commit to emitting a token after a fixed number of steps, but they can continue in an unbounded fashion, so they are very similar to RNNs in this sense.