Thanks for the clarification! If I’m understanding correctly, you’re saying that the important part is decomposing activations (linearly?) and that there’s nothing really baked in about what a component can and cannot be. You normally focus on components, but this can also fully encompass the features as directions frame, by just saying that “the activation component in that direction” is a feature?
a non-linear decomposition as f(x) is an arbitrary function.
Regardless, any decomposition into a computational graph (that we can prove is extensionally equal) is fine.
For instance, if it’s the case that MLP(x) = combine(h(x), g(x)) (via extensional equality), then I can scrub h(x) and g(x) individually.
One example of this could be a product, e.g, suppose that MLP(x) = h(x) * g(x) (maybe like swiglu or something).
We haven’t had to use a non-linear decomposition in our interp work so far at Redwood. Just wanted to point out that it’s possible. I’m not sure when you would want to use one, but I haven’t thought about it that much.
Thanks for the clarification! If I’m understanding correctly, you’re saying that the important part is decomposing activations (linearly?) and that there’s nothing really baked in about what a component can and cannot be. You normally focus on components, but this can also fully encompass the features as directions frame, by just saying that “the activation component in that direction” is a feature?
Yes! The important part is decomposing activations (not neccessarily linearly). I can rewrite my MLP as:
MLP(x) = f(x) + (MLP(x) - f(x))
and then claim that the MLP(x) - f(x) term is unimportant. There is an example of this in the parentheses balancer example.
Thanks! Can you give a non-linear decomposition example?
I would typically call
MLP(x) = f(x) + (MLP(x) - f(x))
a non-linear decomposition as f(x) is an arbitrary function.
Regardless, any decomposition into a computational graph (that we can prove is extensionally equal) is fine. For instance, if it’s the case that MLP(x) = combine(h(x), g(x)) (via extensional equality), then I can scrub h(x) and g(x) individually.
One example of this could be a product, e.g, suppose that MLP(x) = h(x) * g(x) (maybe like swiglu or something).
We haven’t had to use a non-linear decomposition in our interp work so far at Redwood. Just wanted to point out that it’s possible. I’m not sure when you would want to use one, but I haven’t thought about it that much.