‘We view it as an interesting open question whether it is possible to develop SSM-like models with greater expressivity for state tracking that also have strong parallelizability and learning dynamics, or whether these different goals are fundamentally at odds, as Merrill & Sabharwal (2023a) suggest.’
We view it as an interesting open question whether it is possible to develop SSM-like models with greater expressivity for state tracking that also have strong parallelizability and learning dynamics
Surely fundamentally at odds? You can’t spend a while thinking without spending a while thinking.
Of course, the lunch still might be very cheap by only spending a while thinking a fraction of the time or whatever.
Like transformers, SSMs like Mamba also have weak single forward passes: The Illusion of State in State-Space Models (summary thread). As suggested previously in The Parallelism Tradeoff: Limitations of Log-Precision Transformers, this may be due to a fundamental tradeoff between parallelizability and expressivity:
‘We view it as an interesting open question whether it is possible to develop SSM-like models with greater expressivity for state tracking that also have strong parallelizability and learning dynamics, or whether these different goals are fundamentally at odds, as Merrill & Sabharwal (2023a) suggest.’
Surely fundamentally at odds? You can’t spend a while thinking without spending a while thinking.
Of course, the lunch still might be very cheap by only spending a while thinking a fraction of the time or whatever.