Hey I’m not finished reading this yet but I noticed something off about what you said.
At the end, the final 1600-dimensional vector is multiplied by W’s transpose to project back into vocab space.
This isn’t quite right. They don’t multiply by W’s transpose at the end. Rather there is a completely new matrix at the end, whose shape is the same as the transpose of W.
You can see this in huggingface’s code for GPT2. In the class GPT2LMHeadModel the final matrix multiplication is performed by the matrix called “lm_head”, where as the matrix you call W which is used to map 50,257 dimensional vectors into 1600 dimensional space is called “wte” (found in the GPT2Model class). You can see from the code that wte has shape “Vocab size x Embed Size” while lm_head has shape “Embed Size x Vocab size” so lm_head does have the same shape as W transpose but doesn’t have the same numbers.
Edit: I could be wrong here, though. Maybe lm_head was set to be equal to wte transpose? I’m looking through the GPT-2 paper but don’t see anything like that mentioned.
Maybe lm_head was set to be equal to wte transpose?
Yes, this is the case in GPT-2. Perhaps the huggingface implementation supports making these two matrices different, but they are the same in the official GPT-2.
In OpenAI’s tensorflow code, see lines 154 and 171 of src/model.py. The variable “wte” is defined on 151, then re-used on 171.
In the original GPT paper, see eqs. (2) in section 3.1. The same matrix W_e is used twice. (The GPT-2 and GPT-3 papers just refer you back to the GPT paper for architecture details, so the GPT paper is the place to look.)
Edit: I think the reason this is obscured in the huggingface implementation is that they always distinguish the internal layers of a transformer from the “head” used to convert the final layer outputs into predictions. The intent is easy swapping between different “heads” with the same “body” beneath.
This forces their code to allow for heads that differ from the input embedding matrix, even when they implement models like GPT-2 where the official specification says they are the same.
Edit2: might as well say explicitly that I find the OpenAI tensorflow code much more readable than the huggingface code. This isn’t a critique of the latter; it’s trying to support every transformer out there in a unified framework. But if you only care about GPT, this introduces a lot of distracting abstraction.
Hey I’m not finished reading this yet but I noticed something off about what you said.
This isn’t quite right. They don’t multiply by W’s transpose at the end. Rather there is a completely new matrix at the end, whose shape is the same as the transpose of W.
You can see this in huggingface’s code for GPT2. In the class GPT2LMHeadModel the final matrix multiplication is performed by the matrix called “lm_head”, where as the matrix you call W which is used to map 50,257 dimensional vectors into 1600 dimensional space is called “wte” (found in the GPT2Model class). You can see from the code that wte has shape “Vocab size x Embed Size” while lm_head has shape “Embed Size x Vocab size” so lm_head does have the same shape as W transpose but doesn’t have the same numbers.
Edit: I could be wrong here, though. Maybe lm_head was set to be equal to wte transpose? I’m looking through the GPT-2 paper but don’t see anything like that mentioned.
Yes, this is the case in GPT-2. Perhaps the huggingface implementation supports making these two matrices different, but they are the same in the official GPT-2.
In OpenAI’s tensorflow code, see lines 154 and 171 of src/model.py. The variable “wte” is defined on 151, then re-used on 171.
In the original GPT paper, see eqs. (2) in section 3.1. The same matrix W_e is used twice. (The GPT-2 and GPT-3 papers just refer you back to the GPT paper for architecture details, so the GPT paper is the place to look.)
Edit: I think the reason this is obscured in the huggingface implementation is that they always distinguish the internal layers of a transformer from the “head” used to convert the final layer outputs into predictions. The intent is easy swapping between different “heads” with the same “body” beneath.
This forces their code to allow for heads that differ from the input embedding matrix, even when they implement models like GPT-2 where the official specification says they are the same.
Edit2: might as well say explicitly that I find the OpenAI tensorflow code much more readable than the huggingface code. This isn’t a critique of the latter; it’s trying to support every transformer out there in a unified framework. But if you only care about GPT, this introduces a lot of distracting abstraction.
Thanks for the info.
This was a great read, very informative.