I’m a little bit skeptical of the argument in “Transformers are not special”—it seems like, if there were other architectures which had slightly greater capabilities than the Transformer, and which were relatively low-hanging fruit, we would have found them already.
I’m in academia, so I can’t say for sure what is going on at big companies like Google. But I assume that, following the 2017 release of the Transformer, they allocated different research teams to pursuing different directions: some research teams for scaling, and others for the development of new architectures. It seems like, at least in NLP, all of the big flashy new models have come about via scaling. This suggests to me that, within companies like Google, the research teams assigned to scaling are experiencing success, while the research teams assigned to new architectures are not.
It genuinely surprises me that the Transformer still has not been replaced as the dominant architecture since 2017. It does not surprise me that sufficiently large or fancy RNNs can achieve similar performance to the Transformer. The lack of Transformer replacements makes me wonder whether we have hit the limit on the effectiveness of autoregressive language models, though I also wouldn’t be surprised if someone comes up with a better autoregressive architecture soon.
Being slightly better isn’t enough to unseat an entrenched option that is well understood. It would probably have to very noticeably better, particularly in scaling.
I expect the way the internal structures are used will usually dominate the details of the internal structure (once you’re already at the pretty good frontier).
If you’re already extremely familiar with transformers, and you can simply change how you use transformers for possible gains, you’re more likely to do that than to explore a from-scratch technique.
For example, in my research, I’m currently looking into some changes to the outer loop of execution to make language models interpretable by construction. I want to focus on that part of it, and I wanted the research to be easily consumable by other people. Building an entire new architecture from scratch would be a lot of work and would be less familiar to others. So, not surprisingly, I picked a transformer for the internal architecture.
But I also have other ideas about how it could be done that I suspect would work quite well. Bit hard to justify doing that for safety research, though :P
I think the amount of low hanging fruit is so high that we can productively investigate transformer derivatives for a long time without diminishing returns. They’re more like a canvas than some fixed Way To Do Things. It’s just also possible someone makes a jump with a non-transformer architecture at some point.
There have been a few papers with architectures showing performance that matches transformers on smaller datasets with scaling that looks promising. I can tell you that I’ve switched from attention to an architecture loosely based on one of these papers because it performed better on a smallish dataset in my project but I haven’t tested it on any standard vision or language datasets, so I don’t have any concrete evidence yet. Nevertheless, my guess is that indeed there is nothing special about transformers.
I’m a little bit skeptical of the argument in “Transformers are not special”—it seems like, if there were other architectures which had slightly greater capabilities than the Transformer, and which were relatively low-hanging fruit, we would have found them already.
I’m in academia, so I can’t say for sure what is going on at big companies like Google. But I assume that, following the 2017 release of the Transformer, they allocated different research teams to pursuing different directions: some research teams for scaling, and others for the development of new architectures. It seems like, at least in NLP, all of the big flashy new models have come about via scaling. This suggests to me that, within companies like Google, the research teams assigned to scaling are experiencing success, while the research teams assigned to new architectures are not.
It genuinely surprises me that the Transformer still has not been replaced as the dominant architecture since 2017. It does not surprise me that sufficiently large or fancy RNNs can achieve similar performance to the Transformer. The lack of Transformer replacements makes me wonder whether we have hit the limit on the effectiveness of autoregressive language models, though I also wouldn’t be surprised if someone comes up with a better autoregressive architecture soon.
I think what’s going on is something like:
Being slightly better isn’t enough to unseat an entrenched option that is well understood. It would probably have to very noticeably better, particularly in scaling.
I expect the way the internal structures are used will usually dominate the details of the internal structure (once you’re already at the pretty good frontier).
If you’re already extremely familiar with transformers, and you can simply change how you use transformers for possible gains, you’re more likely to do that than to explore a from-scratch technique.
For example, in my research, I’m currently looking into some changes to the outer loop of execution to make language models interpretable by construction. I want to focus on that part of it, and I wanted the research to be easily consumable by other people. Building an entire new architecture from scratch would be a lot of work and would be less familiar to others. So, not surprisingly, I picked a transformer for the internal architecture.
But I also have other ideas about how it could be done that I suspect would work quite well. Bit hard to justify doing that for safety research, though :P
I think the amount of low hanging fruit is so high that we can productively investigate transformer derivatives for a long time without diminishing returns. They’re more like a canvas than some fixed Way To Do Things. It’s just also possible someone makes a jump with a non-transformer architecture at some point.
There have been a few papers with architectures showing performance that matches transformers on smaller datasets with scaling that looks promising. I can tell you that I’ve switched from attention to an architecture loosely based on one of these papers because it performed better on a smallish dataset in my project but I haven’t tested it on any standard vision or language datasets, so I don’t have any concrete evidence yet. Nevertheless, my guess is that indeed there is nothing special about transformers.
I’d be interested to see links to those papers!
I’ve messaged you the links. Basically MLPs.