An Interpretability Illusion from Population Statistics in Causal Analysis

This is an informal note on an interpretability illusion I’ve personally encountered, twice, in two different settings

Causal Analysis

  • Let’s say we have a function , (e.g. our model), an intervention , (e.g. causally ablating one component), a downstream metric , (e.g. the logit difference), and a dataset (e.g. the IOI dataset).

  • We have a hypothesis that the intervention will affect the metric in some predictable way; e.g. that it causes to increase.

  • A standard practice in interpretability is to do causal analysis. Concretely, we compute and . Oversimplifying a bit, if , then we conclude that the hypothesis was true.

Dataset Choice Admits Illusions

  • Consider an alternative dataset where

  • Consider constructing a new dataset . Note that

  • If we ran the same causal analysis as before, we’d also conclude that the hypothesis was true, despite it being untrue for

Therefore even if the hypothesis seems true, it may actually be true only for a slice of the data that we test the model on.

Case Studies

Attention-out SAE feature in IOI

  • In recent work with Jacob Drori, we found evidence for an SAE feature that seemed quite important for the IOI task.

  • Casual analysis of this feature on all IOI data supports the hypothesis that it’s involved.

  • However, looking at the activation patterns reveals that the feature was only involved in the BABA variant specifically.

Steering Vectors

As an example, here’s a comparison of population-level steering and sample-level steering on the believes-in-gun-rights dataset (from Model-written Evals).

While the population-level statistics show a smooth increase, the sample-level statistics tell a more interesting story; different examples steer differently, and in particular there seem to be a significant fraction where steering actually works the opposite of how we’d like.

Conclusion

There is an extensive and growing literature on interpretability illusions, but I don’t think I’ve heard other people talk about this particular one before. It’s also quite plausible that some previous mech interp work needs to be re-evaluated in light of this illusion.