LawrenceC comments on LLM Basics: Embedding Spaces—Transformer Token Vectors Are Not Points in Space

LawrenceC 13 Feb 2023 21:41 UTC
3 points
0
I agree that there’s many reasons that directions do matter, but clearly distance would matter too in the softmax case!
Also, without layernorm, intermediate components of the network could “care more’ about the magnitude of the residual stream (whereas it only matters for the unembed here), while for networks w/ layernorm the intermediate components literally do not have access to magnitude data!
- Marius Hobbhahn 13 Feb 2023 21:52 UTC
  4 points
  0
  Parent
  fair. You convinced me that the effect is more determined by layer-norm than cross-entropy.