Neel Nanda comments on An Interpretability Illusion for Activation Patching of Arbitrary Subspaces

Neel Nanda 26 Jan 2024 9:43 UTC
LW: 3 AF: 2
1
AF
The illusion is most concerning when learning arbitrary directions in space, not when iterating over individual neurons OR SAE features. I don’t have strong takes on whether the illusion is more likely with neurons than SAEs if you’re eg iterating over sparse subsets, in some sense it’s more likely that you get a dormant and a disconnected feature in your SAE than as neurons since they are more meaningful?