Thank you! I’m still surprised how little most heads in L0 + L1 seem to be using the positional embeddings. L1H4 looks reasonably uniform so I could accept that maybe that somehow feeds into L2H2.
Thank you! I’m still surprised how little most heads in L0 + L1 seem to be using the positional embeddings. L1H4 looks reasonably uniform so I could accept that maybe that somehow feeds into L2H2.