nostalgebraist comments on How useful is mechanistic interpretability?

nostalgebraist 2 Dec 2023 18:42 UTC
43 points
15
This is a great, thought-provoking critique of SAEs.
That said, I think SAEs make more sense if we’re trying to explain an LLM (or any generative model of messy real-world data) than they do if we’re trying to explain the animal-drawing NN.
In the animal-drawing example:
- There’s only one thing the NN does.
- It’s always doing that thing, for every input.
- The thing is simple enough that, at a particular point in the NN, you can write out all the variables the NN cares about in a fully compositional code and still use fewer coordinates (50) than the dictionary size of any reasonable SAE.
With something like an LLM, we expect the situation to be more like:
- The NN can do a huge number of “things” or “tasks.” (Equivalently, it can model many different parts of the data manifold with different structures.)
- For any given input, it’s only doing roughly one of these “tasks.”
- If you try to write out a fully compositional code for each task—akin to the size / furriness / etc. code, but we have a separate one for every task—and then take the Cartesian product of them all to get a giant compositional code for everything at once, this code would have a vast number of coordinates. Much larger than the activation vectors we’d be explaining with an SAE, and also much larger than the dictionary of that SAE.
- The aforementioned code would also be super wasteful, because it uses most of its capacity expressing states where multiple tasks compose in an impossible or nonsensical fashion. (Like “The height of the animal currently being drawn is X, AND the current Latin sentence is in the subjunctive mood, AND we are partway through a Rust match expression, AND this author of this op-ed is very right-wing.”)
- The NN doesn’t have enough coordinates to express this Cartesian product code, but it also doesn’t need to do so, because the code is wasteful. Instead, it expresses things in a way that’s less-than-fully-compositional (“superposed”) across tasks, no matter how compositional it is within tasks.
- Even if every task is represented in a maximally compositional way, the per-task coordinates are still sparse, because we’re only doing ~1 task at once and there are many tasks. The compositional nature of the per-task features doesn’t prohibit them from being sparse, because tasks are sparse.
- The reason we’re turning to SAEs is that the NN doesn’t have enough capacity to write out the giant Cartesian product code, so instead it leverages the fact that tasks are sparse, and “re-uses” the same activation coordinates to express different things in different task-contexts.
  - If this weren’t the case, interpretability would be much simpler: we’d just hunt for a transformation that extracts the Cartesian product code from the NN activations, and then we’re done.
  - If it existed, this transformation would probably (?) be linear, b/c the information needs to be linearly retrievable within the NN; something in the animal-painter that cares about height needs to be able to look at the height variable, and ideally to do so without wasting a nonlinearity on reconstructing it.
- Our goal in using the SAE is not to explain everything in a maximally sparse way; it’s to factor the problem into (sparse tasks) x (possibly dense within-task codes).
- Why might that happen in practice? If we fit an SAE to the NN activations on the full data distribution, covering all the tasks, then there are two competing pressures:
  - On the one hand, the sparsity loss term discourages the SAE from representing any given task in a compositional way, even if the NN does so. All else being equal, this is indeed bad.
  - On the other hand, the finite dictionary size discourages the SAE from expanding the number of coordinates per task indefinitely, since all the other tasks have to fit somewhere too.
- In other words, if your animal-drawing case is one the many tasks, and the SAE is choosing whether to represent it as 50 features that all fire together or 1000 one-hot highly-specific-animal features, it may prefer the former because it doesn’t have room in its dictionary to give every task 1000 features.
- This tension only appears when there are multiple tasks. If you just have one compositionally-represented task and a big dictionary, the SAE does behave pathologically as you describe.
  - But this case is different from the ones that motivate SAEs: there isn’t actually any sparsity in the underlying problem at all!
  - Whereas with LLMs, we can be pretty sure (I would think?) that there’s extreme sparsity in the underlying problem, due to dimension-counting arguments, intuitions about the number of “tasks” in natural data and their level of overlap, observed behaviors where LLMs represent things that are irrelevant to the vast majority of inputs (like retrieving very obscure facts), etc.