RGRGRG comments on The positional embedding matrix and previous-token heads: how do they actually work?

RGRGRG 12 Aug 2023 19:19 UTC
LW: 2 AF: 1
0
AF
This is a surprising and fascinating result. Do you have attention plots of all 144 heads you could share?

I’m particularly interested in the patterns for all heads on layers 0 and 1 matching the following caption
(Left: a 50x50 submatrix of LXHY’s attention pattern on a prompt from openwebtext-10k. Right: the same submatrix of LXHY’s attention pattern, when positional embeddings are averaged as described above.)
- AdamYedidia 14 Aug 2023 22:41 UTC
  1 point
  0
  Parent
  Here’s the plots you asked for for all heads! You can find them at:
  https://github.com/adamyedidia/resid_viewer/tree/main/experiments/pngs
  Haven’t looked too carefully yet but it looks like it makes little difference for most heads, but is important for L0H4 and L0H7.
  - RGRGRG 15 Aug 2023 3:16 UTC
    2 points
    0
    Parent
    Thank you! I’m still surprised how little most heads in L0 + L1 seem to be using the positional embeddings. L1H4 looks reasonably uniform so I could accept that maybe that somehow feeds into L2H2.