Kshitij Sachan comments on Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

Kshitij Sachan 3 Dec 2022 16:08 UTC
LW: 4 AF: 2
1
AF
Nice summary! One small nitpick:
> In the features framing, we don’t necessarily assume that features are aligned with circuit components (eg, they could be arbitrary directions in neuron space), while in the subgraph framing we focus on components and don’t need to show that the components correspond to features
This feels slightly misleading. In practice, we often do claim that sub-components correspond to features. We can “rewrite” our model into an equivalent form that better reflects the computation it’s performing. For example, if we claim that a certain direction in an MLP’s output is important, we could rewrite the single MLP node as the sum of the MLP output in the direction + the residual term. Then, we could make claims about the direction we pointed out and also claim that the residual term is unimportant.
The important point is that we are allowed to rewrite our model however we want as long as the rewrite is equivalent.
- Neel Nanda 3 Dec 2022 16:14 UTC
  LW: 2 AF: 1
  1
  AF Parent
  Thanks for the clarification! If I’m understanding correctly, you’re saying that the important part is decomposing activations (linearly?) and that there’s nothing really baked in about what a component can and cannot be. You normally focus on components, but this can also fully encompass the features as directions frame, by just saying that “the activation component in that direction” is a feature?
  - Kshitij Sachan 4 Dec 2022 5:21 UTC
    LW: 2 AF: 2
    0
    AF Parent
    Yes! The important part is decomposing activations (not neccessarily linearly). I can rewrite my MLP as:
    MLP(x) = f(x) + (MLP(x) - f(x))
    and then claim that the MLP(x) - f(x) term is unimportant. There is an example of this in the parentheses balancer example.
    - Neel Nanda 4 Dec 2022 12:54 UTC
      LW: 2 AF: 1
      0
      AF Parent
      Thanks! Can you give a non-linear decomposition example?
      - ryan_greenblatt 5 Dec 2022 1:50 UTC
        LW: 2 AF: 2
        1
        AF Parent
        I would typically call
        
        MLP(x) = f(x) + (MLP(x) - f(x))
        
        a non-linear decomposition as f(x) is an arbitrary function.
        
        Regardless, any decomposition into a computational graph (that we can prove is extensionally equal) is fine. For instance, if it’s the case that MLP(x) = combine(h(x), g(x)) (via extensional equality), then I can scrub h(x) and g(x) individually.
        
        One example of this could be a product, e.g, suppose that MLP(x) = h(x) * g(x) (maybe like swiglu or something).
      - Kshitij Sachan 5 Dec 2022 20:51 UTC
        1 point
        0
        Parent
        We haven’t had to use a non-linear decomposition in our interp work so far at Redwood. Just wanted to point out that it’s possible. I’m not sure when you would want to use one, but I haven’t thought about it that much.