Also, what vectors are you using? is this the final output layer?
I suggest trying the vectors in the encoder layers 0-48 in GPT2-xl. I am getting the impression that the visualization of those layers are more of a submerged iceberg rather than a helix...
Just an intuition but GPT2 uses GELU function which is normalized by pi.
Btw, what visualization softwared did you use here again?
Python (the matplotlib package).
Thank you. Will try it in the project im working on!
Also, what vectors are you using? is this the final output layer?
I suggest trying the vectors in the encoder layers 0-48 in GPT2-xl. I am getting the impression that the visualization of those layers are more of a submerged iceberg rather than a helix...
Nope, this is the pos_embed matrix! So before the first layer.
I see. I’ll try this thanks!