The immediate practical value of this recent paper is more elusive: they try to do even more by exorcising multiplication from attention, which is a step in an important direction, but the data they get doesn’t seem sufficient to overcome the prior that this is very hard to do successfully. Only Mamba got close to attention as a pure alternative (without the constraint of avoiding multiplication), and even then it has issues unless we hybridize it with (local) attention (which also works well with other forms of attention alternatives, better even than vanilla attention on its own).
This is 2015-2016 tech though. The value of the recent ternary BitNet result is demonstrating that it works well for transformers (which wasn’t nearly as much the case for binary BitNet).
The immediate practical value of this recent paper is more elusive: they try to do even more by exorcising multiplication from attention, which is a step in an important direction, but the data they get doesn’t seem sufficient to overcome the prior that this is very hard to do successfully. Only Mamba got close to attention as a pure alternative (without the constraint of avoiding multiplication), and even then it has issues unless we hybridize it with (local) attention (which also works well with other forms of attention alternatives, better even than vanilla attention on its own).