that paper is one of many claiming some linear attention mechanism that’s as good as full self attention. in practice they’re all sufficiently much worse that nobody uses them except the original authors in the original paper, usually not even the original authors in subsequent papers.
the one exception is flash attention, which is basically just a very fancy fused kernel for the same computation (actually the same, up to numerical error, unlike all these “linear attention” papers).
that paper is one of many claiming some linear attention mechanism that’s as good as full self attention. in practice they’re all sufficiently much worse that nobody uses them except the original authors in the original paper, usually not even the original authors in subsequent papers.
the one exception is flash attention, which is basically just a very fancy fused kernel for the same computation (actually the same, up to numerical error, unlike all these “linear attention” papers).