This is a surprising and fascinating result. Do you have attention plots of all 144 heads you could share?
I’m particularly interested in the patterns for all heads on layers 0 and 1 matching the following caption
(Left: a 50x50 submatrix of LXHY’s attention pattern on a prompt from openwebtext-10k. Right: the same submatrix of LXHY’s attention pattern, when positional embeddings are averaged as described above.)
Thank you! I’m still surprised how little most heads in L0 + L1 seem to be using the positional embeddings. L1H4 looks reasonably uniform so I could accept that maybe that somehow feeds into L2H2.
This is a surprising and fascinating result. Do you have attention plots of all 144 heads you could share?
I’m particularly interested in the patterns for all heads on layers 0 and 1 matching the following caption
(Left: a 50x50 submatrix of LXHY’s attention pattern on a prompt from openwebtext-10k. Right: the same submatrix of LXHY’s attention pattern, when positional embeddings are averaged as described above.)
Here’s the plots you asked for for all heads! You can find them at:
https://github.com/adamyedidia/resid_viewer/tree/main/experiments/pngs
Haven’t looked too carefully yet but it looks like it makes little difference for most heads, but is important for L0H4 and L0H7.
Thank you! I’m still surprised how little most heads in L0 + L1 seem to be using the positional embeddings. L1H4 looks reasonably uniform so I could accept that maybe that somehow feeds into L2H2.