The fact that hybridisation works better than pure architectures (architectures consisting of a single core type of block, we shall say), is exactly the point that Nathan Labenz makes in the podcast and I repeat in the beginning of the post.
(Ah, I actually forgot to repeat this point, apart from noting that Doyle predicted this in his architecture theory.)
Experimental results is a more legible and reliable form of evidence than philosophy-level arguments. When it’s available, it’s the reason to start paying attention to the philosophy in a way the philosophy itself isn’t.
Incidentally, hybrid Mamba/MHA doesn’t work significantly better than pure Mamba, at least the way it’s reported in appendix E.2.2 of the paper (beware left/right confusion in Figure 9). The effect is much more visible with Hyena, though the StripedHyena post gives more details on studying hybridization, so it’s unclear if this was studied for Mamba as thoroughly.
The fact that hybridisation works better than pure architectures (architectures consisting of a single core type of block, we shall say), is exactly the point that Nathan Labenz makes in the podcast and I repeat in the beginning of the post.
(Ah, I actually forgot to repeat this point, apart from noting that Doyle predicted this in his architecture theory.)
Experimental results is a more legible and reliable form of evidence than philosophy-level arguments. When it’s available, it’s the reason to start paying attention to the philosophy in a way the philosophy itself isn’t.
Incidentally, hybrid Mamba/MHA doesn’t work significantly better than pure Mamba, at least the way it’s reported in appendix E.2.2 of the paper (beware left/right confusion in Figure 9). The effect is much more visible with Hyena, though the StripedHyena post gives more details on studying hybridization, so it’s unclear if this was studied for Mamba as thoroughly.