The vanilla Transformer architecture is horrifically computation inefficient. I really thought it was a terrible idea when I learnt about it. On every single token it processes ALL of the weights in the model and ALL of the context. And a token is less than a word — less than a concept. You generally don’t need to consider trivia to fill in grammatical words. On top of that, implementations of it were very inefficient. I was shocked when I read the FlashAttention paper: I had assumed that everyone would have implemented attention that way in the first place, it’s the obvious way to do it if you know anything about memory throughput. (My shock was lessened when I looked at the code and saw how tricky it was to incorporate into PyTorch.) Ditto unfused kernels, another inefficiency that exists to allow writing code in Python instead of CUDA/SYCL/etc.
Second point, transformers also seem to be very parameter inefficient. They have many layers and many attention heads largely so that they can perform multi-step inferences and do a lot in each step if necessary, but mechanistic interpretability studies shows just the center layers do nearly all the work. We now see transformers with shared weights between attention heads and layers and the performance drop is not that much. And there’s also the matter of bits per parameter, again a 10x reduction in precision is a surprisingly small detriment.
I believe that the large numbers of parameters in transformers aren’t primarily there to store knowledge, they’re needed to learn quickly. They perform routing and encode mechanisms (that is, pieces of algorithms) and their vast number provides a blank slate. Training data seen just once is often remembered because there are so many possible places to store it that it’s highly likely there are good paths through the network through which strong gradients can flow to record the information. This is a variant of the Lottery Ticket Hypothesis. But a better training algorithm could in theory do the same thing with fewer parameters. It would probably look very different from SGD.
I agree completely with Karparthy. However, I think you misread him, he didn’t say that data cleaning is the cause of improvements up until now, he suggested a course of future improvements. But there are already plenty of successful examples of small models improved in that way.
So I’m not the least bit surprised to see a 100x efficiency improvement and expect to see another 100x, although probably not as quickly (low hanging fruit). If you have 200B parameters, you probably could process only maybe 50M on average for most tokens. (However, there are many points where you need to draw on a lot of knowledge, and those might pull the average way up.) In 2016, a 50M parameter Transformer was enough for SoTA translation between English/French and I’m sure it could be far more efficient today.
The vanilla Transformer architecture is horrifically computation inefficient. I really thought it was a terrible idea when I learnt about it. On every single token it processes ALL of the weights in the model and ALL of the context. And a token is less than a word — less than a concept. You generally don’t need to consider trivia to fill in grammatical words. On top of that, implementations of it were very inefficient. I was shocked when I read the FlashAttention paper: I had assumed that everyone would have implemented attention that way in the first place, it’s the obvious way to do it if you know anything about memory throughput. (My shock was lessened when I looked at the code and saw how tricky it was to incorporate into PyTorch.) Ditto unfused kernels, another inefficiency that exists to allow writing code in Python instead of CUDA/SYCL/etc.
Second point, transformers also seem to be very parameter inefficient. They have many layers and many attention heads largely so that they can perform multi-step inferences and do a lot in each step if necessary, but mechanistic interpretability studies shows just the center layers do nearly all the work. We now see transformers with shared weights between attention heads and layers and the performance drop is not that much. And there’s also the matter of bits per parameter, again a 10x reduction in precision is a surprisingly small detriment.
I believe that the large numbers of parameters in transformers aren’t primarily there to store knowledge, they’re needed to learn quickly. They perform routing and encode mechanisms (that is, pieces of algorithms) and their vast number provides a blank slate. Training data seen just once is often remembered because there are so many possible places to store it that it’s highly likely there are good paths through the network through which strong gradients can flow to record the information. This is a variant of the Lottery Ticket Hypothesis. But a better training algorithm could in theory do the same thing with fewer parameters. It would probably look very different from SGD.
I agree completely with Karparthy. However, I think you misread him, he didn’t say that data cleaning is the cause of improvements up until now, he suggested a course of future improvements. But there are already plenty of successful examples of small models improved in that way.
So I’m not the least bit surprised to see a 100x efficiency improvement and expect to see another 100x, although probably not as quickly (low hanging fruit). If you have 200B parameters, you probably could process only maybe 50M on average for most tokens. (However, there are many points where you need to draw on a lot of knowledge, and those might pull the average way up.) In 2016, a 50M parameter Transformer was enough for SoTA translation between English/French and I’m sure it could be far more efficient today.