This is an informal note on an interpretability illusion I’ve personally encountered, twice, in two different settings
Causal Analysis
Let’s say we have a function f, (e.g. our model), an intervention I, (e.g. causally ablating one component), a downstream metric M, (e.g. the logit difference), and a dataset D (e.g. the IOI dataset).
We have a hypothesis that the intervention I will affect the metric in some predictable way; e.g. that it causes m to increase.
A standard practice in interpretability is to do causal analysis. Concretely, we compute mclean=M(f;D) and mablate=M(I(f);D). Oversimplifying a bit, if mclean>mablate, then we conclude that the hypothesis was true.
Dataset Choice Admits Illusions
Consider an alternative dataset Dalt where mclean≈mablate
Consider constructing a new dataset D′=D∪Dalt. Note that m′=0.5m+0.5malt
If we ran the same causal analysis as before, we’d also conclude that the hypothesis was true, despite it being untrue for Dalt
Therefore even if the hypothesis seems true, it may actually be true only for a slice of the data that we test the model on.
Case Studies
Attention-out SAE feature in IOI
In recent work with Jacob Drori, we found evidence for an SAE feature that seemed quite important for the IOI task.
Casual analysis of this feature on all IOI data supports the hypothesis that it’s involved.
However, looking at the activation patterns reveals that the feature was only involved in the BABA variant specifically.
Steering Vectors
Previous work on steering vectors uses the average probability of the model giving the correct answer as a metric of the steering vector.
As an example, here’s a comparison of population-level steering and sample-level steering on the believes-in-gun-rights dataset (from Model-written Evals).
While the population-level statistics show a smooth increase, the sample-level statistics tell a more interesting story; different examples steer differently, and in particular there seem to be a significant fraction where steering actually works the opposite of how we’d like.
Conclusion
There is an extensive and growing literature on interpretability illusions, but I don’t think I’ve heard other people talk about this particular one before. It’s also quite plausible that some previous mech interp work needs to be re-evaluated in light of this illusion.
An Interpretability Illusion from Population Statistics in Causal Analysis
This is an informal note on an interpretability illusion I’ve personally encountered, twice, in two different settings
Causal Analysis
Let’s say we have a function f, (e.g. our model), an intervention I, (e.g. causally ablating one component), a downstream metric M, (e.g. the logit difference), and a dataset D (e.g. the IOI dataset).
We have a hypothesis that the intervention I will affect the metric in some predictable way; e.g. that it causes m to increase.
A standard practice in interpretability is to do causal analysis. Concretely, we compute mclean=M(f;D) and mablate=M(I(f);D). Oversimplifying a bit, if mclean>mablate, then we conclude that the hypothesis was true.
Dataset Choice Admits Illusions
Consider an alternative dataset Dalt where mclean≈mablate
Consider constructing a new dataset D′=D∪Dalt. Note that m′=0.5m+0.5malt
If we ran the same causal analysis as before, we’d also conclude that the hypothesis was true, despite it being untrue for Dalt
Therefore even if the hypothesis seems true, it may actually be true only for a slice of the data that we test the model on.
Case Studies
Attention-out SAE feature in IOI
In recent work with Jacob Drori, we found evidence for an SAE feature that seemed quite important for the IOI task.
Casual analysis of this feature on all IOI data supports the hypothesis that it’s involved.
However, looking at the activation patterns reveals that the feature was only involved in the BABA variant specifically.
Steering Vectors
Previous work on steering vectors uses the average probability of the model giving the correct answer as a metric of the steering vector.
In recent work on steering vectors, we found that the population average obscures a lot of sample-level variance.
As an example, here’s a comparison of population-level steering and sample-level steering on the
believes-in-gun-rights
dataset (from Model-written Evals).While the population-level statistics show a smooth increase, the sample-level statistics tell a more interesting story; different examples steer differently, and in particular there seem to be a significant fraction where steering actually works the opposite of how we’d like.
Conclusion
There is an extensive and growing literature on interpretability illusions, but I don’t think I’ve heard other people talk about this particular one before. It’s also quite plausible that some previous mech interp work needs to be re-evaluated in light of this illusion.