What changed with the transformer? To some extent, the transformer is really a “smarter” or “better” architecture than the older RNNs. If you do a head-to-head comparison with the same training data, the RNNs do worse.
But also, it’s feasible to scale transformers much bigger than we could scale the RNNs. You don’t see RNNs as big as GPT-2 or GPT-3 simply because it would take too much compute to train them.
You might be interested in looking at the progress being made on the RWKV-LM architecture, if you aren’t following it. It’s an attempt to train an RNN like a transformer. Initial numbers look pretty good.
You might be interested in looking at the progress being made on the RWKV-LM architecture, if you aren’t following it. It’s an attempt to train an RNN like a transformer. Initial numbers look pretty good.