(See this comment for more context.) The point is to make inference cheaper in operations and energy, which seems crucial primarily for local inference on smartphones, but in principle might make datacenter inference cheaper in the long run, if a new generation of hardware specialized for inference adapts to this development. The bulk of the improvement (without significant degradation of performance) was already demonstrated for transformers with ternary BitNet (see also this “Code and FAQ” followup report with better data on degradation of performance; only “download raw file” button works for me).
What they attempt to do in the paper you link is extract even more improvement by getting rid of multiplication in attention, and so they explore alternative ways of implementing attention, since the general technique doesn’t work with standard attention out of the box. But attention has long evaded attempts to approximate it without degradation of performance (usually when trying to enable long context), the best general approach seems to be to hybridize an efficient attention alternative with precise sliding window (local) attention (by including one or the other in different layers). They reference the Griffin paper, but don’t seem to engage with this point on hybridization, so it’s something for future work to pick up.
(See this comment for more context.) The point is to make inference cheaper in operations and energy, which seems crucial primarily for local inference on smartphones, but in principle might make datacenter inference cheaper in the long run, if a new generation of hardware specialized for inference adapts to this development. The bulk of the improvement (without significant degradation of performance) was already demonstrated for transformers with ternary BitNet (see also this “Code and FAQ” followup report with better data on degradation of performance; only “download raw file” button works for me).
What they attempt to do in the paper you link is extract even more improvement by getting rid of multiplication in attention, and so they explore alternative ways of implementing attention, since the general technique doesn’t work with standard attention out of the box. But attention has long evaded attempts to approximate it without degradation of performance (usually when trying to enable long context), the best general approach seems to be to hybridize an efficient attention alternative with precise sliding window (local) attention (by including one or the other in different layers). They reference the Griffin paper, but don’t seem to engage with this point on hybridization, so it’s something for future work to pick up.
Thanks for the context, I really appreciate it! :)