Lucius Bushnaq comments on A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Lucius Bushnaq 24 Jul 2024 12:52 UTC
3 points
0
Toy example of what I would consider pretty clear-cut cross-layer superposition:

We have a residual MLP network. The network implements a single UAND gate (universal AND, calculating the $\frac{d^{2}}{2}$ pairwise ANDs of $d$ sparse boolean input features using only $d$ neurons), as described in Section 3 here.
However, instead of implementing this with a single MLP, the network does this using all the MLPs of all the layers in combination. Simple construction that achieves this:
1. Cut the residual stream into two subspaces, reserving one subspace for the input features and one subspace for the $\frac{d^{2}}{2}$ output features.
2. Take the construction from the paper, and assign each neuron in it to a random MLP layer in the residual network.
3. Since the input and output spaces are orthogonal, there’s no possibility of one MLP’s outputs interfering with another MLP’s inputs. So this network will implement UAND, as if all the neurons lived in a single large MLP layer.
Now we’ve made a network that computes boolean circuits in superposition, without the boolean gates living in any particular MLP. To read out the value of one of the circuit outputs before it shows up in the residual stream, you’ll need to look at a direction that’s a linear combination of neurons in all of the MLPs. And if you use an SAE to look at a single residual stream position in this network before the very final MLP layer, it’ll probably show you a bunch of half-computed nonsense.

In a real network, the most convincing evidence to me would be a circuit involving sparse coded variables or operations that cannot be localized to any single MLP.