Donald Hobson comments on The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

Donald Hobson 1 Dec 2022 0:18 UTC
LW: 4 AF: 3
0
AF
What does the network do if you use SVD editing to knock out every uninterpretable column? What if you knock out everything interpretable?
- beren 2 Dec 2022 15:58 UTC
  1 point
  0
  Parent
  This is an interesting question! At the end of the post / in the colab we experiment with knocking out specific singular directions and show that it differentially affects tokens of roughly the same semantics. We find this to be quite a robust effect but that actually affecting network output can be surprisingly difficult as there seems to be large amounts of redundancy where similar processing happens in many layers/blocks simultaneously.
  Knocking out every interpretable/uninterpretable column is a cool idea and we haven’t tried it. My suspicion is that this would just be too much damage to the network and would scramble things but it might be worth a shot.