We are surprised by the decrease in Residual Stream norm in some of the EleutherAI models. ... According to the model card, the Pythia models have “exactly the same” architectures as their OPT counterparts
I could very well be completely wrong here, but I suspect this could primarily be an artifact of different unembeddings.
It seemed to me from the model card that although the Pythia models have “exactly the same” architecture, they only have the same number ofnon-embedding parameters. The Pythia models all have more total parameters than their counterparts and therefore more embedding parameters, implying that they’re using a different embedding/unembedding scheme. In particular, the EleutherAI models use the GPT-NeoX-20B tokenizer instead of the GPT-2 tokenizer (they also use rotary embeddings, which I don’t expect to matter as much).
In addition, all the decreases in Residual Stream norm occur in the last 2 layers, which is exactly where I would’ve expected to see artifacts of the embedding/unembedding process[1]. I’m not familiar enough with the differences in the tokenizers to have predicted the decreasing Residual Stream norm ex ante, but it seems kinda likely ex post that whatever’s causing this large systematic difference in EleutherAI models’ norms is due to them using a different tokenizer.
I also would’ve expected to see these artifacts in the first layer, which we don’t really see, so take this with a grain of salt, I guess. I do still think this is pretty characteristic of “SGD trying its best to deal with unembedding shenanigans by doing weird things in the last layer or two, leaving the rest mostly untouched,” but this might just be me pattern-matching to a bad internal narrative/trope I’ve developed.
I could very well be completely wrong here, but I suspect this could primarily be an artifact of different unembeddings.
It seemed to me from the model card that although the Pythia models have “exactly the same” architecture, they only have the same number of non-embedding parameters. The Pythia models all have more total parameters than their counterparts and therefore more embedding parameters, implying that they’re using a different embedding/unembedding scheme. In particular, the EleutherAI models use the GPT-NeoX-20B tokenizer instead of the GPT-2 tokenizer (they also use rotary embeddings, which I don’t expect to matter as much).
In addition, all the decreases in Residual Stream norm occur in the last 2 layers, which is exactly where I would’ve expected to see artifacts of the embedding/unembedding process[1]. I’m not familiar enough with the differences in the tokenizers to have predicted the decreasing Residual Stream norm ex ante, but it seems kinda likely ex post that whatever’s causing this large systematic difference in EleutherAI models’ norms is due to them using a different tokenizer.
I also would’ve expected to see these artifacts in the first layer, which we don’t really see, so take this with a grain of salt, I guess. I do still think this is pretty characteristic of “SGD trying its best to deal with unembedding shenanigans by doing weird things in the last layer or two, leaving the rest mostly untouched,” but this might just be me pattern-matching to a bad internal narrative/trope I’ve developed.