I think @wesg’s recent post on pathological SAE reconstruction errors is relevant here. It points out that there are very particular directions such that intervening on activations along these directions significantly impacts downstream model behavior, while this is not the case for most randomly sampled directions.
Also see @jake_mendel’s great comment for an intuitive explanation of why (probably) this is the case.
I think @wesg’s recent post on pathological SAE reconstruction errors is relevant here. It points out that there are very particular directions such that intervening on activations along these directions significantly impacts downstream model behavior, while this is not the case for most randomly sampled directions.
Also see @jake_mendel’s great comment for an intuitive explanation of why (probably) this is the case.