Nice summary! One small nitpick: > In the features framing, we don’t necessarily assume that features are aligned with circuit components (eg, they could be arbitrary directions in neuron space), while in the subgraph framing we focus on components and don’t need to show that the components correspond to features
This feels slightly misleading. In practice, we often do claim that sub-components correspond to features. We can “rewrite” our model into an equivalent form that better reflects the computation it’s performing. For example, if we claim that a certain direction in an MLP’s output is important, we could rewrite the single MLP node as the sum of the MLP output in the direction + the residual term. Then, we could make claims about the direction we pointed out and also claim that the residual term is unimportant.
The important point is that we are allowed to rewrite our model however we want as long as the rewrite is equivalent.
Thanks for the clarification! If I’m understanding correctly, you’re saying that the important part is decomposing activations (linearly?) and that there’s nothing really baked in about what a component can and cannot be. You normally focus on components, but this can also fully encompass the features as directions frame, by just saying that “the activation component in that direction” is a feature?
a non-linear decomposition as f(x) is an arbitrary function.
Regardless, any decomposition into a computational graph (that we can prove is extensionally equal) is fine.
For instance, if it’s the case that MLP(x) = combine(h(x), g(x)) (via extensional equality), then I can scrub h(x) and g(x) individually.
One example of this could be a product, e.g, suppose that MLP(x) = h(x) * g(x) (maybe like swiglu or something).
We haven’t had to use a non-linear decomposition in our interp work so far at Redwood. Just wanted to point out that it’s possible. I’m not sure when you would want to use one, but I haven’t thought about it that much.
Nice summary! One small nitpick:
> In the features framing, we don’t necessarily assume that features are aligned with circuit components (eg, they could be arbitrary directions in neuron space), while in the subgraph framing we focus on components and don’t need to show that the components correspond to features
This feels slightly misleading. In practice, we often do claim that sub-components correspond to features. We can “rewrite” our model into an equivalent form that better reflects the computation it’s performing. For example, if we claim that a certain direction in an MLP’s output is important, we could rewrite the single MLP node as the sum of the MLP output in the direction + the residual term. Then, we could make claims about the direction we pointed out and also claim that the residual term is unimportant.
The important point is that we are allowed to rewrite our model however we want as long as the rewrite is equivalent.
Thanks for the clarification! If I’m understanding correctly, you’re saying that the important part is decomposing activations (linearly?) and that there’s nothing really baked in about what a component can and cannot be. You normally focus on components, but this can also fully encompass the features as directions frame, by just saying that “the activation component in that direction” is a feature?
Yes! The important part is decomposing activations (not neccessarily linearly). I can rewrite my MLP as:
MLP(x) = f(x) + (MLP(x) - f(x))
and then claim that the MLP(x) - f(x) term is unimportant. There is an example of this in the parentheses balancer example.
Thanks! Can you give a non-linear decomposition example?
I would typically call
MLP(x) = f(x) + (MLP(x) - f(x))
a non-linear decomposition as f(x) is an arbitrary function.
Regardless, any decomposition into a computational graph (that we can prove is extensionally equal) is fine. For instance, if it’s the case that MLP(x) = combine(h(x), g(x)) (via extensional equality), then I can scrub h(x) and g(x) individually.
One example of this could be a product, e.g, suppose that MLP(x) = h(x) * g(x) (maybe like swiglu or something).
We haven’t had to use a non-linear decomposition in our interp work so far at Redwood. Just wanted to point out that it’s possible. I’m not sure when you would want to use one, but I haven’t thought about it that much.