Oh, just that it preserves distance/L2 norm/angles/orthogonality. I find that this is often an important intuition, since a bunch of operations in transformers depend on orthogonality/norm. In particular, norm is useful for thinking about weight decay and as rough heuristics for eg info transferred.
Probably most important is that whenever a layer tries to read a feature from the residual stream, it’s basically projecting the residual stream onto a single dimension, which hopefully corresponds to that feature. And, importantly, ignoring all orthogonal dimensions—the projection operation is invariant under rotations (orthonormal change of bases) but NOT under arbitrary change of bases)
What exactly do you have in mind here?
Oh, just that it preserves distance/L2 norm/angles/orthogonality. I find that this is often an important intuition, since a bunch of operations in transformers depend on orthogonality/norm. In particular, norm is useful for thinking about weight decay and as rough heuristics for eg info transferred.
Probably most important is that whenever a layer tries to read a feature from the residual stream, it’s basically projecting the residual stream onto a single dimension, which hopefully corresponds to that feature. And, importantly, ignoring all orthogonal dimensions—the projection operation is invariant under rotations (orthonormal change of bases) but NOT under arbitrary change of bases)