Being slightly better isn’t enough to unseat an entrenched option that is well understood. It would probably have to very noticeably better, particularly in scaling.
I expect the way the internal structures are used will usually dominate the details of the internal structure (once you’re already at the pretty good frontier).
If you’re already extremely familiar with transformers, and you can simply change how you use transformers for possible gains, you’re more likely to do that than to explore a from-scratch technique.
For example, in my research, I’m currently looking into some changes to the outer loop of execution to make language models interpretable by construction. I want to focus on that part of it, and I wanted the research to be easily consumable by other people. Building an entire new architecture from scratch would be a lot of work and would be less familiar to others. So, not surprisingly, I picked a transformer for the internal architecture.
But I also have other ideas about how it could be done that I suspect would work quite well. Bit hard to justify doing that for safety research, though :P
I think the amount of low hanging fruit is so high that we can productively investigate transformer derivatives for a long time without diminishing returns. They’re more like a canvas than some fixed Way To Do Things. It’s just also possible someone makes a jump with a non-transformer architecture at some point.
I think what’s going on is something like:
Being slightly better isn’t enough to unseat an entrenched option that is well understood. It would probably have to very noticeably better, particularly in scaling.
I expect the way the internal structures are used will usually dominate the details of the internal structure (once you’re already at the pretty good frontier).
If you’re already extremely familiar with transformers, and you can simply change how you use transformers for possible gains, you’re more likely to do that than to explore a from-scratch technique.
For example, in my research, I’m currently looking into some changes to the outer loop of execution to make language models interpretable by construction. I want to focus on that part of it, and I wanted the research to be easily consumable by other people. Building an entire new architecture from scratch would be a lot of work and would be less familiar to others. So, not surprisingly, I picked a transformer for the internal architecture.
But I also have other ideas about how it could be done that I suspect would work quite well. Bit hard to justify doing that for safety research, though :P
I think the amount of low hanging fruit is so high that we can productively investigate transformer derivatives for a long time without diminishing returns. They’re more like a canvas than some fixed Way To Do Things. It’s just also possible someone makes a jump with a non-transformer architecture at some point.