I don’t think similarly-sized transformers would do much better and might do worse. Section 3.4 shows that large models trained from scratch massively overfit to the data. I vaguely recall the authors saying that similarly-sized transformers tended to be harder to train as well.
I don’t think similarly-sized transformers would do much better and might do worse. Section 3.4 shows that large models trained from scratch massively overfit to the data. I vaguely recall the authors saying that similarly-sized transformers tended to be harder to train as well.