this difference is a result of pre-layer normalisation and post-layer normalisation? So if there is pre-layer norm you can’t have dimensions in your embeddings with significantly larger entries because all the small entries would be normed to hell. But if there is post-layer normalisation some dimensions might have systematically high entries (possibly immideately corrected by a bias term?). Always having high entries in the same dimensions makes all vectors very similar.
My quick take would be that
this difference is a result of pre-layer normalisation and post-layer normalisation? So if there is pre-layer norm you can’t have dimensions in your embeddings with significantly larger entries because all the small entries would be normed to hell. But if there is post-layer normalisation some dimensions might have systematically high entries (possibly immideately corrected by a bias term?). Always having high entries in the same dimensions makes all vectors very similar.