Filip Sondej comments on An Interpretability Illusion for Activation Patching of Arbitrary Subspaces

Filip Sondej 7 May 2024 12:02 UTC
LW: 1 AF: 1
0
AF
What if we constrain v to be in some subspace that is actually used by the MLP? (We can get it from PCA over activations on many inputs.)

This way v won’t have any dormant component, so the MLP output after patching also cannot use that dormant pathway.