Does this mean hugely superior architectures to transformers (measured by benchmarking them with the same compute and data input) don’t exist or that transformers and RNNs and everything else are all close enough cousins?
The latter. I am quite certain that hugely superior architectures exist in the sense of both superior exponents and superior constants (but I’m less sure about being hugely strictly dominated on both), and these are the sorts of things that are what the whole hierarchy of meta-learning is about learning/locating; but that the current sets of architectures are all pretty much alike in being big blobs of feedforward architectures whose inductive biases wash out at what is, in absolute terms, quite small scales (so small scale we can achieve them right now with small budgets like millions to billions of dollars) as long as they achieve basic desiderata in terms of passing signals/gradients through themselves without blowing up/flatlining. DL archs fail in many different ways, but the successes are alike: ‘the AI Karenina principle’. Thus, the retrodiction that deep (>4) stacks of fully-connected layers just needed normalization to compete; my long-standing assertion that that Transformers are not special fairy-dust, self-attention not magical, and Transformers are basically better-optimized RNNs; and my (recently vindicated) prediction that despite the entire field abandoning them for the past 3-4 years because they had been ‘proven unstable’, GANs would nevertheless work well once anyone bothered to scale them up.
Does this mean hugely superior architectures to transformers (measured by benchmarking them with the same compute and data input) don’t exist or that transformers and RNNs and everything else are all close enough cousins?
The latter. I am quite certain that hugely superior architectures exist in the sense of both superior exponents and superior constants (but I’m less sure about being hugely strictly dominated on both), and these are the sorts of things that are what the whole hierarchy of meta-learning is about learning/locating; but that the current sets of architectures are all pretty much alike in being big blobs of feedforward architectures whose inductive biases wash out at what is, in absolute terms, quite small scales (so small scale we can achieve them right now with small budgets like millions to billions of dollars) as long as they achieve basic desiderata in terms of passing signals/gradients through themselves without blowing up/flatlining. DL archs fail in many different ways, but the successes are alike: ‘the AI Karenina principle’. Thus, the retrodiction that deep (>4) stacks of fully-connected layers just needed normalization to compete; my long-standing assertion that that Transformers are not special fairy-dust, self-attention not magical, and Transformers are basically better-optimized RNNs; and my (recently vindicated) prediction that despite the entire field abandoning them for the past 3-4 years because they had been ‘proven unstable’, GANs would nevertheless work well once anyone bothered to scale them up.