I agree with Eliezer’s recommendation to double-check results in papers that one finds surprising.
So, I looked into the claim of a 10x − 100x gain for transformers, using Table 2 from the paper. Detailed results are in this Colab.
Briefly, I don’t think the claim of 10x − 100x is well supported. Depending on what exactly you compute, you get anywhere from “no speedup” to “over 300x speedup.” All the estimates you can make have obvious problems, and all show a massive gap between French and German.
In detail:
The appearance of a large speedup is heavily affected by the fact that previous SOTAs were ensembles, and ensembling is a very inefficient way to spend compute.
In terms of simple BLEU / compute, the efficiency gain from transformers looks about 10x smaller if we compare to non-ensembled older models.
Simple BLEU / compute is not a great metric because of diminishing marginal returns.
By this metric, the small transformer is ~6x “better” than the big one!
By this metric, small transformer has a speedup of ~6x to ~40x, while big transformer has a speedup of ~1x to ~6x.
We can try to estimate marginal returns by comparing sizes for transformers, and ensembled vs. not for older methods.
This gives a speedup of ~5x for German and ~100x to ~300x for French
But this is not an apples-to-apples comparison, as the transformer is scaled while the others are ensembled.
I imagine this question has been investigated much more rigorously outside the original paper. The first Kaplan scaling paper does this for LMs; I dunno who has done it for MT, but I’d be surprised if no one has.
EDIT: something I want to know is why ensembling was popular before transformers, but not after them. If ensembling older models was actually better than scaling them, that would weaken my conclusion a lot.
I don’t know if ensembling vs. scaling has been rigorously tested, either for transformers or older models.
I agree with Eliezer’s recommendation to double-check results in papers that one finds surprising.
So, I looked into the claim of a 10x − 100x gain for transformers, using Table 2 from the paper. Detailed results are in this Colab.
Briefly, I don’t think the claim of 10x − 100x is well supported. Depending on what exactly you compute, you get anywhere from “no speedup” to “over 300x speedup.” All the estimates you can make have obvious problems, and all show a massive gap between French and German.
In detail:
The appearance of a large speedup is heavily affected by the fact that previous SOTAs were ensembles, and ensembling is a very inefficient way to spend compute.
In terms of simple BLEU / compute, the efficiency gain from transformers looks about 10x smaller if we compare to non-ensembled older models.
Simple BLEU / compute is not a great metric because of diminishing marginal returns.
By this metric, the small transformer is ~6x “better” than the big one!
By this metric, small transformer has a speedup of ~6x to ~40x, while big transformer has a speedup of ~1x to ~6x.
We can try to estimate marginal returns by comparing sizes for transformers, and ensembled vs. not for older methods.
This gives a speedup of ~5x for German and ~100x to ~300x for French
But this is not an apples-to-apples comparison, as the transformer is scaled while the others are ensembled.
I imagine this question has been investigated much more rigorously outside the original paper. The first Kaplan scaling paper does this for LMs; I dunno who has done it for MT, but I’d be surprised if no one has.
EDIT: something I want to know is why ensembling was popular before transformers, but not after them. If ensembling older models was actually better than scaling them, that would weaken my conclusion a lot.
I don’t know if ensembling vs. scaling has been rigorously tested, either for transformers or older models.