The induction heads extrapolation graph is a bit cherry picked/misleading IMO because it’s still the case that mamba can’t copy an arbitrary amount of text. (It can copy a some fixed finite number of tokens if it is predictable that those tokens will be copied over other tokens.)
E.g., if you have N tokens repeated twice, mamba will fail to get perfect loss on the second repetition for sufficiently large N. To see this, note that the total hidden state is bounded so eventually there will be enough text to fill this state. This isn’t true for transformers.
It’s unclear how much of an obstacle this is in practice, but it hints at ways in which transformers might be relatively fundamentally more capacity efficient. Note that this issues also applies to some sparse attention mechanisms like sliding window attention, but any attention mechanism that stores state for the entire sequence should be able to avoid this issue.
(This is a well known result, though I forget the current state of empirical results.)
‘Our theoretical analysis reveals that CoT improves RNNs but is insufficient to close the gap with Transformers. A key bottleneck lies in the inability of RNNs to perfectly retrieve information from the context, even with CoT: for several tasks that explicitly or implicitly require this capability, such as associative recall and determining if a graph is a tree, we prove that RNNs are not expressive enough to solve the tasks while Transformers can solve them with ease. Conversely, we prove that adopting techniques to enhance the in-context retrieval capability of RNNs, including Retrieval-Augmented Generation (RAG) and adding a single Transformer layer, can elevate RNNs to be capable of solving all polynomial-time solvable problems with CoT, hence closing the representation gap with Transformers.’
The induction heads extrapolation graph is a bit cherry picked/misleading IMO because it’s still the case that mamba can’t copy an arbitrary amount of text. (It can copy a some fixed finite number of tokens if it is predictable that those tokens will be copied over other tokens.)
E.g., if you have N tokens repeated twice, mamba will fail to get perfect loss on the second repetition for sufficiently large N. To see this, note that the total hidden state is bounded so eventually there will be enough text to fill this state. This isn’t true for transformers.
It’s unclear how much of an obstacle this is in practice, but it hints at ways in which transformers might be relatively fundamentally more capacity efficient. Note that this issues also applies to some sparse attention mechanisms like sliding window attention, but any attention mechanism that stores state for the entire sequence should be able to avoid this issue.
(This is a well known result, though I forget the current state of empirical results.)
Seems relevant—RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval:
‘Our theoretical analysis reveals that CoT improves RNNs but is insufficient to close the gap with Transformers. A key bottleneck lies in the inability of RNNs to perfectly retrieve information from the context, even with CoT: for several tasks that explicitly or implicitly require this capability, such as associative recall and determining if a graph is a tree, we prove that RNNs are not expressive enough to solve the tasks while Transformers can solve them with ease. Conversely, we prove that adopting techniques to enhance the in-context retrieval capability of RNNs, including Retrieval-Augmented Generation (RAG) and adding a single Transformer layer, can elevate RNNs to be capable of solving all polynomial-time solvable problems with CoT, hence closing the representation gap with Transformers.’