I read this and, it said:
there are huge low hanging fruit that any AI or random person designing AI in their garage can find by just grasping in the dark a bit, to get huge improvements at accelerating speeds.
have we found anything like this? at all? have we seen any “weird tricks” discovered that make AI way more powerful for no reason?
I’m not sure if these would be classed as “weird tricks” and I definitely think these have reasons for working, but some recent architecture changes which one might not expect to work a priori include:
SwiGLU: Combines a gating mechanism and an activation function with learnable parameters.
Grouped Query Attention: Uses fewer Key and Value heads than Query heads.
RMSNorm: Layernorm but without the translation.
Rotary Position Embeddings: Rotates token embeddings to give them positional information.
Quantization: Fewer bit weights without much drop in performance.
Flash Attention: More efficient attention computation through better memory management.
Various sparse attention schemes