as Transformers scale, they become ever more just ‘MLP archs with some self-attention’, and yet, they continue to work well and scale as usual. This should trouble people who believe self-attention is magic!...I’d also point out the scaling studies of Nie et al 2021, where self-attention outperforms MLPs… but not by much, decreasing by scale, and even a tiny amount of self-attention essentially closes the gap. (And Liu et al 2021.)
This work presents an analysis of the effectiveness of using standard shallow feed-forward networks to mimic the behavior of the attention mechanism in the original Transformer model, a state-of-the-art architecture for sequence-to-sequence tasks.
We substitute key elements of the attention mechanism in the Transformer with simple feed-forward networks, trained using the original components via knowledge distillation.
Our experiments, conducted on the IWSLT2017 dataset, reveal the capacity of these “attentionless Transformers” to rival the performance of the original architecture.
Through rigorous ablation studies, and experimenting with various replacement network types and sizes, we offer insights that support the viability of our approach. This not only sheds light on the adaptability of shallow feed-forward networks in emulating attention mechanisms but also underscores their potential to streamline complex architectures for sequence-to-sequence tasks.
The initial point of our procedure is the training of the vanilla Transformer model, which consists of six encoders and six decoders. To reduce training times and make testing faster, we reduced the embedding size from 512 to 128. These changes did not drop the overall score too much below the original BLEU score but resulted in significantly lower computational power demands. This Transformer model was then used as a teacher model for training the feedforward networks
...
We also adopted a fixed upper bound to sentence length, which we set to 50
Such short lengths make simple memorization work better, reducing architectural advantages.
Their Transformer baseline BLEU score for english-french translation is 0.276 vs 0.43 for NLLB-200. Yet their modified design results were strictly worse than their Transformer baselines.
In the other direction, MLP layers can be trained to imitate realistic self-attention layers with high accuracy: “Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers”, Bozic et al 2023:
I’m not impressed by that paper:
...
Such short lengths make simple memorization work better, reducing architectural advantages.
Their Transformer baseline BLEU score for english-french translation is 0.276 vs 0.43 for NLLB-200. Yet their modified design results were strictly worse than their Transformer baselines.