Marcus Williams answers Have we seen any “ReLU instead of sigmoid-type improvements” recently

Marcus Williams 23 Nov 2024 8:46 UTC
9 points
0
I’m not sure if these would be classed as “weird tricks” and I definitely think these have reasons for working, but some recent architecture changes which one might not expect to work a priori include:
- SwiGLU: Combines a gating mechanism and an activation function with learnable parameters.
- Grouped Query Attention: Uses fewer Key and Value heads than Query heads.
- RMSNorm: Layernorm but without the translation.
- Rotary Position Embeddings: Rotates token embeddings to give them positional information.
- Quantization: Fewer bit weights without much drop in performance.
- Flash Attention: More efficient attention computation through better memory management.
- Various sparse attention schemes
- KvmanThinking 23 Nov 2024 14:15 UTC
  1 point
  0
  Parent
  How were these discovered? Slow, deliberate thinking, or someone trying some random thing to see what it does and suddenly the AI is a zillion times smarter?
  - Marcus Williams 23 Nov 2024 15:58 UTC
    2 points
    0
    Parent
    “We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.” -SwiGLU paper.
    
    I think it varies, a few of these are trying “random” things, but mostly they are educated guesses which are then validated empirically. Often there is a spefic problem we want to solve i.e. exploding gradients or O(n^2) attention and then authors try things which may or may not solve/mitigate the problem.