This is an interesting question! At the end of the post / in the colab we experiment with knocking out specific singular directions and show that it differentially affects tokens of roughly the same semantics. We find this to be quite a robust effect but that actually affecting network output can be surprisingly difficult as there seems to be large amounts of redundancy where similar processing happens in many layers/blocks simultaneously.
Knocking out every interpretable/uninterpretable column is a cool idea and we haven’t tried it. My suspicion is that this would just be too much damage to the network and would scramble things but it might be worth a shot.
What does the network do if you use SVD editing to knock out every uninterpretable column? What if you knock out everything interpretable?
This is an interesting question! At the end of the post / in the colab we experiment with knocking out specific singular directions and show that it differentially affects tokens of roughly the same semantics. We find this to be quite a robust effect but that actually affecting network output can be surprisingly difficult as there seems to be large amounts of redundancy where similar processing happens in many layers/blocks simultaneously.
Knocking out every interpretable/uninterpretable column is a cool idea and we haven’t tried it. My suspicion is that this would just be too much damage to the network and would scramble things but it might be worth a shot.