Up to a certain limit; Kaplan covers this in the talk a bit with reference to the RNN scaling curves in Kaplan et al 2020 - RNNs scale similarly to Transformers, with a worse constant in terms of compute, but they make bad use of context. After a few hundred tokens, the history has vanished. This is the usual RNN problem: theoretically, the history is unlimited, but as has been observed long before, the history is de facto limited to a few hundred tokens, while Transformers make effective use of history from thousands of timesteps before.
So I interpret this as meaning that NN architectures are all ‘universal’ in a sense (they all scale similarly, and I’m told that CNNs do too), but what makes Transformers superior is that they are more compute-efficient on current hardware and they optimize much better because, as ‘unrolled RNNs’, they are equivalently powerful but they have much more direct access to the history (pace residual layers) which makes the credit assignment/learning much easier than RNNs which must squeeze it all into a hidden state rather than recalculating a function with the entire raw history available.
(Lots of potential followup questions here: can you usefully distill a trained Transformer into a parameter & compute-efficient RNN? Can that provide a training signal to meta-learn RNN algorithms which do fix their history/optimization problems? If Transformers work so well because of raw long-range access to history, are RNNs just missing some ‘external memory’ module which would serve the same purpose? Do RNNs likewise have general scaling curves over all modalities? Where do Mixture-of-Experts flatline and what is the MoE scaling exponent?)
I think it also mentioned that it isn’t architecture-specific either; bigger LSTMs scale similarly to bigger transformers, they are just worse. IIRC.
Up to a certain limit; Kaplan covers this in the talk a bit with reference to the RNN scaling curves in Kaplan et al 2020 - RNNs scale similarly to Transformers, with a worse constant in terms of compute, but they make bad use of context. After a few hundred tokens, the history has vanished. This is the usual RNN problem: theoretically, the history is unlimited, but as has been observed long before, the history is de facto limited to a few hundred tokens, while Transformers make effective use of history from thousands of timesteps before.
So I interpret this as meaning that NN architectures are all ‘universal’ in a sense (they all scale similarly, and I’m told that CNNs do too), but what makes Transformers superior is that they are more compute-efficient on current hardware and they optimize much better because, as ‘unrolled RNNs’, they are equivalently powerful but they have much more direct access to the history (pace residual layers) which makes the credit assignment/learning much easier than RNNs which must squeeze it all into a hidden state rather than recalculating a function with the entire raw history available.
(Lots of potential followup questions here: can you usefully distill a trained Transformer into a parameter & compute-efficient RNN? Can that provide a training signal to meta-learn RNN algorithms which do fix their history/optimization problems? If Transformers work so well because of raw long-range access to history, are RNNs just missing some ‘external memory’ module which would serve the same purpose? Do RNNs likewise have general scaling curves over all modalities? Where do Mixture-of-Experts flatline and what is the MoE scaling exponent?)