Could you elaborate on the role you think layernorm is playing? You’re not the first person to suggest this, and I’d be interested to explore further. Thanks!
Any time the embeddings / residual stream vectors is used for anything, they are projected onto the surface of a n−1 dimensional hypersphere. This changes the geometry.
Could you elaborate on the role you think layernorm is playing? You’re not the first person to suggest this, and I’d be interested to explore further. Thanks!
Any time the embeddings / residual stream vectors is used for anything, they are projected onto the surface of a n−1 dimensional hypersphere. This changes the geometry.