Until 2017, the best performing language models were LSTMs, which have been around since 1997. However, LSTMs in their late era of dominance were distinguished from early LSTMs by experimenting with attention and a few other mechanisms, though it’s unclear to me how much this boosted their performance.
The paper that unseated LSTMs for language models reported an additional 2.0 BLEU score (range from 0 to 100) gained by switching to the new model, though this is likely an underestimate of the gain by switching to Transformers given that the old state-of-the-art models were tweaked very carefully.
My guess is that the 2000 model using 2020 compute would beat the 2020 model using 2000 compute easily, though I would love to see someone to do a deeper dive into this question.
Until 2017, the best performing language models were LSTMs, which have been around since 1997. However, LSTMs in their late era of dominance were distinguished from early LSTMs by experimenting with attention and a few other mechanisms, though it’s unclear to me how much this boosted their performance.
The paper that unseated LSTMs for language models reported an additional 2.0 BLEU score (range from 0 to 100) gained by switching to the new model, though this is likely an underestimate of the gain by switching to Transformers given that the old state-of-the-art models were tweaked very carefully.
My guess is that the 2000 model using 2020 compute would beat the 2020 model using 2000 compute easily, though I would love to see someone to do a deeper dive into this question.