This is an informal note on an interpretability illusion I’ve personally encountered, twice, in two different settings

Causal Analysis

Let’s say we have a function $f$ , (e.g. our model), an intervention $I$ , (e.g. causally ablating one component), a downstream metric $M$ , (e.g. the logit difference), and a dataset $D$ (e.g. the IOI dataset).
We have a hypothesis that the intervention $I$ will affect the metric in some predictable way; e.g. that it causes $m$ to increase.
A standard practice in interpretability is to do causal analysis. Concretely, we compute $m_{c l e a n} = M (f; D)$ and $m_{a b l a t e} = M (I (f); D)$ . Oversimplifying a bit, if $m_{c l e a n} > m_{a b l a t e}$ , then we conclude that the hypothesis was true.

Dataset Choice Admits Illusions

Consider an alternative dataset $D^{a l t}$ where $m_{c l e a n} \approx m_{a b l a t e}$
Consider constructing a new dataset $D^{'} = D \cup D^{a l t}$ . Note that $m^{'} = 0.5 m + 0.5 m^{a l t}$
If we ran the same causal analysis as before, we’d also conclude that the hypothesis was true, despite it being untrue for $D^{a l t}$

Therefore even if the hypothesis seems true, it may actually be true only for a slice of the data that we test the model on.

Case Studies

Attention-out SAE feature in IOI

In recent work with Jacob Drori, we found evidence for an SAE feature that seemed quite important for the IOI task.
Casual analysis of this feature on all IOI data supports the hypothesis that it’s involved.
However, looking at the activation patterns reveals that the feature was only involved in the BABA variant specifically.

Steering Vectors

Previous work on steering vectors uses the average probability of the model giving the correct answer as a metric of the steering vector.
In recent work on steering vectors, we found that the population average obscures a lot of sample-level variance.

As an example, here’s a comparison of population-level steering and sample-level steering on the believes-in-gun-rights dataset (from Model-written Evals).

While the population-level statistics show a smooth increase, the sample-level statistics tell a more interesting story; different examples steer differently, and in particular there seem to be a significant fraction where steering actually works the opposite of how we’d like.

Conclusion

There is an extensive and growing literature on interpretability illusions, but I don’t think I’ve heard other people talk about this particular one before. It’s also quite plausible that some previous mech interp work needs to be re-evaluated in light of this illusion.