Maybe lm_head was set to be equal to wte transpose?
Yes, this is the case in GPT-2. Perhaps the huggingface implementation supports making these two matrices different, but they are the same in the official GPT-2.
In OpenAI’s tensorflow code, see lines 154 and 171 of src/model.py. The variable “wte” is defined on 151, then re-used on 171.
In the original GPT paper, see eqs. (2) in section 3.1. The same matrix W_e is used twice. (The GPT-2 and GPT-3 papers just refer you back to the GPT paper for architecture details, so the GPT paper is the place to look.)
Edit: I think the reason this is obscured in the huggingface implementation is that they always distinguish the internal layers of a transformer from the “head” used to convert the final layer outputs into predictions. The intent is easy swapping between different “heads” with the same “body” beneath.
This forces their code to allow for heads that differ from the input embedding matrix, even when they implement models like GPT-2 where the official specification says they are the same.
Edit2: might as well say explicitly that I find the OpenAI tensorflow code much more readable than the huggingface code. This isn’t a critique of the latter; it’s trying to support every transformer out there in a unified framework. But if you only care about GPT, this introduces a lot of distracting abstraction.
Yes, this is the case in GPT-2. Perhaps the huggingface implementation supports making these two matrices different, but they are the same in the official GPT-2.
In OpenAI’s tensorflow code, see lines 154 and 171 of src/model.py. The variable “wte” is defined on 151, then re-used on 171.
In the original GPT paper, see eqs. (2) in section 3.1. The same matrix W_e is used twice. (The GPT-2 and GPT-3 papers just refer you back to the GPT paper for architecture details, so the GPT paper is the place to look.)
Edit: I think the reason this is obscured in the huggingface implementation is that they always distinguish the internal layers of a transformer from the “head” used to convert the final layer outputs into predictions. The intent is easy swapping between different “heads” with the same “body” beneath.
This forces their code to allow for heads that differ from the input embedding matrix, even when they implement models like GPT-2 where the official specification says they are the same.
Edit2: might as well say explicitly that I find the OpenAI tensorflow code much more readable than the huggingface code. This isn’t a critique of the latter; it’s trying to support every transformer out there in a unified framework. But if you only care about GPT, this introduces a lot of distracting abstraction.
Thanks for the info.
This was a great read, very informative.