Neel Nanda comments on Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

Neel Nanda 3 Dec 2022 16:14 UTC
LW: 2 AF: 1
1
AF
Thanks for the clarification! If I’m understanding correctly, you’re saying that the important part is decomposing activations (linearly?) and that there’s nothing really baked in about what a component can and cannot be. You normally focus on components, but this can also fully encompass the features as directions frame, by just saying that “the activation component in that direction” is a feature?
- Kshitij Sachan 4 Dec 2022 5:21 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Yes! The important part is decomposing activations (not neccessarily linearly). I can rewrite my MLP as:
  MLP(x) = f(x) + (MLP(x) - f(x))
  and then claim that the MLP(x) - f(x) term is unimportant. There is an example of this in the parentheses balancer example.
  - Neel Nanda 4 Dec 2022 12:54 UTC
    LW: 2 AF: 1
    0
    AF Parent
    Thanks! Can you give a non-linear decomposition example?
    - ryan_greenblatt 5 Dec 2022 1:50 UTC
      LW: 2 AF: 2
      1
      AF Parent
      I would typically call
      
      MLP(x) = f(x) + (MLP(x) - f(x))
      
      a non-linear decomposition as f(x) is an arbitrary function.
      
      Regardless, any decomposition into a computational graph (that we can prove is extensionally equal) is fine. For instance, if it’s the case that MLP(x) = combine(h(x), g(x)) (via extensional equality), then I can scrub h(x) and g(x) individually.
      
      One example of this could be a product, e.g, suppose that MLP(x) = h(x) * g(x) (maybe like swiglu or something).
    - Kshitij Sachan 5 Dec 2022 20:51 UTC
      1 point
      0
      Parent
      We haven’t had to use a non-linear decomposition in our interp work so far at Redwood. Just wanted to point out that it’s possible. I’m not sure when you would want to use one, but I haven’t thought about it that much.