p.b. comments on Mech Interp Puzzle 1: Suspiciously Similar Embeddings in GPT-Neo

p.b. 18 Jul 2023 9:35 UTC
1 point
0
My quick take would be that
this difference is a result of pre-layer normalisation and post-layer normalisation? So if there is pre-layer norm you can’t have dimensions in your embeddings with significantly larger entries because all the small entries would be normed to hell. But if there is post-layer normalisation some dimensions might have systematically high entries (possibly immideately corrected by a bias term?). Always having high entries in the same dimensions makes all vectors very similar.