The initial point of our procedure is the training of the vanilla Transformer model, which consists of six encoders and six decoders. To reduce training times and make testing faster, we reduced the embedding size from 512 to 128. These changes did not drop the overall score too much below the original BLEU score but resulted in significantly lower computational power demands. This Transformer model was then used as a teacher model for training the feedforward networks
...
We also adopted a fixed upper bound to sentence length, which we set to 50
Such short lengths make simple memorization work better, reducing architectural advantages.
Their Transformer baseline BLEU score for english-french translation is 0.276 vs 0.43 for NLLB-200. Yet their modified design results were strictly worse than their Transformer baselines.
I’m not impressed by that paper:
...
Such short lengths make simple memorization work better, reducing architectural advantages.
Their Transformer baseline BLEU score for english-french translation is 0.276 vs 0.43 for NLLB-200. Yet their modified design results were strictly worse than their Transformer baselines.