That sounds more like my intuition, though obviously there still have to be differences given that we keep using self-attention (quadratic in N) instead of MLPs (linear in N).
In the limit of infinite scaling, the fact that MLPs are universal function approximators is a guarantee that you can do anything with them. But obviously we still would rather have something that can actually work with less-than-infinite amounts of compute.
That sounds more like my intuition, though obviously there still have to be differences given that we keep using self-attention (quadratic in N) instead of MLPs (linear in N).
In the limit of infinite scaling, the fact that MLPs are universal function approximators is a guarantee that you can do anything with them. But obviously we still would rather have something that can actually work with less-than-infinite amounts of compute.