notrishi comments on notrishi’s Shortform

notrishi 8 Feb 2025 20:00 UTC
1 point
−2
The Sasha Rush/Jonathan Frankle wager: https://www.isattentionallyouneed.com/ is extremely unlikely to be untrue by 2027, but it’s not because another architecture might not be better; it’s because it asks whether a transformer-like model will be sota . I think it is more likely that transformers are a proper subset of a class of generalized token/sequence mixers. Even SSMs when unrolled into a cumulative sum are a special case of linear attention.
Personally I do believe that there will be a deeply recurrent method that is transformer-like to succeed the transformer architecture, even though this is an unpopular opinion.
- notrishi 1 Jul 2025 23:10 UTC
  1 point
  0
  Parent
  I changed my mind on this after seeing the recent literature with regards to test time training linear attentions