Kshitij Sachan comments on Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

Kshitij Sachan 4 Dec 2022 5:21 UTC
LW: 2 AF: 2
0
AF
Yes! The important part is decomposing activations (not neccessarily linearly). I can rewrite my MLP as:
MLP(x) = f(x) + (MLP(x) - f(x))
and then claim that the MLP(x) - f(x) term is unimportant. There is an example of this in the parentheses balancer example.
- Neel Nanda 4 Dec 2022 12:54 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Thanks! Can you give a non-linear decomposition example?
  - ryan_greenblatt 5 Dec 2022 1:50 UTC
    LW: 2 AF: 2
    1
    AF Parent
    I would typically call
    
    MLP(x) = f(x) + (MLP(x) - f(x))
    
    a non-linear decomposition as f(x) is an arbitrary function.
    
    Regardless, any decomposition into a computational graph (that we can prove is extensionally equal) is fine. For instance, if it’s the case that MLP(x) = combine(h(x), g(x)) (via extensional equality), then I can scrub h(x) and g(x) individually.
    
    One example of this could be a product, e.g, suppose that MLP(x) = h(x) * g(x) (maybe like swiglu or something).
  - Kshitij Sachan 5 Dec 2022 20:51 UTC
    1 point
    0
    Parent
    We haven’t had to use a non-linear decomposition in our interp work so far at Redwood. Just wanted to point out that it’s possible. I’m not sure when you would want to use one, but I haven’t thought about it that much.