Why does GPT-3 use the same matrix for word embedding and final predictions? I would expect this to constrain the model, and the only potential upsides I can see are saving parameters (lol) and preserving interpretability (lmao)[8]. Other resources like A Mathematical Framework for Transformer Circuits use different embedding/unembedding matrices—their WE and WU. Perhaps this is not necessary for GPT-3 since the final feed-forward network can perform an appropriate linear transformation, and in A Mathematical Framework they are looking at transformers without FFNs. But some properties (e.g. words being linear combinations of other words) cannot be changed by such a linear transformation, so having an entire new unembedding matrix could still add value.
This is called “tied embeddings”. You’re right that models don’t need to have this constraint, and some don’t—for instance, GPT-NeoX. I’m not sure whether or not this actually improves performance in practice though.
This is called “tied embeddings”. You’re right that models don’t need to have this constraint, and some don’t—for instance, GPT-NeoX. I’m not sure whether or not this actually improves performance in practice though.