If SAE features are the correct units of analysis (or at least more so than neurons), should we expect that patching in the feature basis is less susceptible to the interpretability illusion than in the neuron basis?
The illusion is most concerning when learning arbitrary directions in space, not when iterating over individual neurons OR SAE features. I don’t have strong takes on whether the illusion is more likely with neurons than SAEs if you’re eg iterating over sparse subsets, in some sense it’s more likely that you get a dormant and a disconnected feature in your SAE than as neurons since they are more meaningful?
If SAE features are the correct units of analysis (or at least more so than neurons), should we expect that patching in the feature basis is less susceptible to the interpretability illusion than in the neuron basis?
The illusion is most concerning when learning arbitrary directions in space, not when iterating over individual neurons OR SAE features. I don’t have strong takes on whether the illusion is more likely with neurons than SAEs if you’re eg iterating over sparse subsets, in some sense it’s more likely that you get a dormant and a disconnected feature in your SAE than as neurons since they are more meaningful?