There have been a few papers with architectures showing performance that matches transformers on smaller datasets with scaling that looks promising. I can tell you that I’ve switched from attention to an architecture loosely based on one of these papers because it performed better on a smallish dataset in my project but I haven’t tested it on any standard vision or language datasets, so I don’t have any concrete evidence yet. Nevertheless, my guess is that indeed there is nothing special about transformers.
There have been a few papers with architectures showing performance that matches transformers on smaller datasets with scaling that looks promising. I can tell you that I’ve switched from attention to an architecture loosely based on one of these papers because it performed better on a smallish dataset in my project but I haven’t tested it on any standard vision or language datasets, so I don’t have any concrete evidence yet. Nevertheless, my guess is that indeed there is nothing special about transformers.
I’d be interested to see links to those papers!
I’ve messaged you the links. Basically MLPs.