The Sasha Rush/Jonathan Frankle wager: https://www.isattentionallyouneed.com/ is extremely unlikely to be untrue by 2027, but it’s not because another architecture might not be better; it’s because it asks whether a transformer-like model will be sota . I think it is more likely that transformers are a proper subset of a class of generalized token/sequence mixers. Even SSMs when unrolled into a cumulative sum are a special case of linear attention. Personally I do believe that there will be a deeply recurrent method that is transformer-like to succeed the transformer architecture, even though this is an unpopular opinion.
The Sasha Rush/Jonathan Frankle wager: https://www.isattentionallyouneed.com/ is extremely unlikely to be untrue by 2027, but it’s not because another architecture might not be better; it’s because it asks whether a transformer-like model will be sota . I think it is more likely that transformers are a proper subset of a class of generalized token/sequence mixers. Even SSMs when unrolled into a cumulative sum are a special case of linear attention.
Personally I do believe that there will be a deeply recurrent method that is transformer-like to succeed the transformer architecture, even though this is an unpopular opinion.
I changed my mind on this after seeing the recent literature with regards to test time training linear attentions