tl;dr: For a hovering aircraft, upward thrust equals weight, but this isn’t what determines engine power.
I’m no expert, but the important distinction is between power and force (thrust). Power is work done (energy transferred) per unit time, and if you were just gliding slowly in a large and light unpowered glider at a fixed altitude (pretending negligible drag), or to be actually realistic, hovering in a blimp, with lift equalling weight, you’re doing no work! (And neither is gravity.) On the other hand when a helicopter hovers at a fixed altitude it’s doing a great deal of work accelerating a volume of air downwards. (See also Gravity loss for a rocket.)
Now the interesting part: although for a hovering airplane, blimp or helicopter the upward force produced is equal to the weight, the power needed is different because the formulas for thrust and power aren’t directly linked. Thrust: . To compute work done on the air, consider the kinetic energy imparted on the air pushed down in one second. Power: . Let’s say your helicopter is , and simplify the gravitational constant as , so your weight is . To create an equal upward thrust you could push of air per second downwards at … or of air at . But the former requires a power of while the latter is only ! (This is a lower bound on, and directly proportional to, the energy in the fuel the engine must burn.)
So, to be fuel efficient a helicopter would have to have long blades that turn slowly, moving a large volume of air down slowly. But they don’t, apparently it’s not feasible. I imagine lighter helicopters can be more efficient though? And I’m not going to do any calculations for fixed wing aircraft. IANAAE.
This is also why turboprob and turbofan engines are more efficient than plain turbojet engines: they can produce the same thrust while expelling air at a lower velocity, hence with less work done, by using the jet engine to drive a propeller or fan.
The vanilla Transformer architecture is horrifically computation inefficient. I really thought it was a terrible idea when I learnt about it. On every single token it processes ALL of the weights in the model and ALL of the context. And a token is less than a word — less than a concept. You generally don’t need to consider trivia to fill in grammatical words. On top of that, implementations of it were very inefficient. I was shocked when I read the FlashAttention paper: I had assumed that everyone would have implemented attention that way in the first place, it’s the obvious way to do it if you know anything about memory throughput. (My shock was lessened when I looked at the code and saw how tricky it was to incorporate into PyTorch.) Ditto unfused kernels, another inefficiency that exists to allow writing code in Python instead of CUDA/SYCL/etc.
Second point, transformers also seem to be very parameter inefficient. They have many layers and many attention heads largely so that they can perform multi-step inferences and do a lot in each step if necessary, but mechanistic interpretability studies shows just the center layers do nearly all the work. We now see transformers with shared weights between attention heads and layers and the performance drop is not that much. And there’s also the matter of bits per parameter, again a 10x reduction in precision is a surprisingly small detriment.
I believe that the large numbers of parameters in transformers aren’t primarily there to store knowledge, they’re needed to learn quickly. They perform routing and encode mechanisms (that is, pieces of algorithms) and their vast number provides a blank slate. Training data seen just once is often remembered because there are so many possible places to store it that it’s highly likely there are good paths through the network through which strong gradients can flow to record the information. This is a variant of the Lottery Ticket Hypothesis. But a better training algorithm could in theory do the same thing with fewer parameters. It would probably look very different from SGD.
I agree completely with Karparthy. However, I think you misread him, he didn’t say that data cleaning is the cause of improvements up until now, he suggested a course of future improvements. But there are already plenty of successful examples of small models improved in that way.
So I’m not the least bit surprised to see a 100x efficiency improvement and expect to see another 100x, although probably not as quickly (low hanging fruit). If you have 200B parameters, you probably could process only maybe 50M on average for most tokens. (However, there are many points where you need to draw on a lot of knowledge, and those might pull the average way up.) In 2016, a 50M parameter Transformer was enough for SoTA translation between English/French and I’m sure it could be far more efficient today.