Cool that you figured that out, easily explains the high cosine similarity! It does seem to me that a large constant offset to all the embeddings is interesting, since that means GPT-Neo’s later layers have to do computation taking that into account, which seems not at all like an efficient decision. I will def poke around more.
Interesting on MLP0 (I swear I use zero indexing lol just got momentarily confused)! Does that hold across the different GPT sizes?
Cool that you figured that out, easily explains the high cosine similarity! It does seem to me that a large constant offset to all the embeddings is interesting, since that means GPT-Neo’s later layers have to do computation taking that into account, which seems not at all like an efficient decision. I will def poke around more.
Interesting on MLP0 (I swear I use zero indexing lol just got momentarily confused)! Does that hold across the different GPT sizes?
Haven’t checked lol