I’m not sure if these would be classed as “weird tricks” and I definitely think these have reasons for working, but some recent architecture changes which one might not expect to work a priori include:
SwiGLU: Combines a gating mechanism and an activation function with learnable parameters.
Grouped Query Attention: Uses fewer Key and Value heads than Query heads.
RMSNorm: Layernorm but without the translation.
Rotary Position Embeddings: Rotates token embeddings to give them positional information.
Quantization: Fewer bit weights without much drop in performance.
Flash Attention: More efficient attention computation through better memory management.
I’m not sure if these would be classed as “weird tricks” and I definitely think these have reasons for working, but some recent architecture changes which one might not expect to work a priori include:
SwiGLU: Combines a gating mechanism and an activation function with learnable parameters.
Grouped Query Attention: Uses fewer Key and Value heads than Query heads.
RMSNorm: Layernorm but without the translation.
Rotary Position Embeddings: Rotates token embeddings to give them positional information.
Quantization: Fewer bit weights without much drop in performance.
Flash Attention: More efficient attention computation through better memory management.
Various sparse attention schemes