Andy Arditi comments on Mechanistically Eliciting Latent Behaviors in Language Models

Andy Arditi 3 May 2024 15:13 UTC
3 points
1
I think @wesg’s recent post on pathological SAE reconstruction errors is relevant here. It points out that there are very particular directions such that intervening on activations along these directions significantly impacts downstream model behavior, while this is not the case for most randomly sampled directions.
Also see @jake_mendel’s great comment for an intuitive explanation of why (probably) this is the case.