ojorgensen comments on Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

ojorgensen Feb 1, 2023, 5:51 PM
LW: 7 AF: 4
1
AF
This seems very similar to recent work that has come out of the Stanford AI Lab recently, linked to here.
- Buck Mar 18, 2023, 4:28 PM
  LW: 3 AF: 2
  0
  AF Parent
  It’s a pretty different algorithm, though obviously it’s trying to solve a related problem.
  - Erik Jenner Mar 28, 2023, 9:40 PM
    LW: 22 AF: 16
    2
    AF Parent
    ETA: We’ve now written a post that compares causal scrubbing and the Geiger et al. approach in much more detail: https://www.alignmentforum.org/posts/uLMWMeBG3ruoBRhMW/a-comparison-of-causal-scrubbing-causal-abstractions-and
    I still endorse the main takeaways from my original comment below, but the list of differences isn’t quite right (the newer papers by Geiger et al. do allow multiple interventions, and I neglected the impact that treeification has in causal scrubbing).
    To me, the methods seem similar in much more than just the problem they’re tackling. In particular, the idea in both cases seems to be:
    One format for explanations of a model is a causal/computational graph together with a description of how that graph maps onto the full computation.
    Such an explanation makes predictions about what should happen under various interventions on the activations of the full model, by replacing them with activations on different inputs.
    We can check the explanation by performing those activation replacements and seeing if the impact is what we predicted.
    Here are all the differences I can see:
    In the Stanford line of work, the output of the full model and of the explanation are the same type, instead of the explanation having a simplified output. But as far as I can tell, we could always just add a final step to the full computation that simplifies the output to basically bridge this gap.
    How the methods quantify the extent to which a hypothesis isn’t perfect: at least in this paper, the Stanford authors look at the size of the largest subset of the input distribution on which the hypothesis is perfect, instead of taking the expectation of the scrubbed output.
    The “interchange interventions” in the Stanford papers are allowed to change the activations in the explanation. They then check whether the output after intervention changes in the way the explanation would predict, as opposed to checking that the scrubbed output stays the same. (So along this axis, causal scrubbing just performs a subset of all the interchange interventions.)
    Apparently the Stanford authors only perform one intervention at a time, whereas causal scrubbing performs all possible interventions at once.
    These all strike me as differences in implementation of fundamentally the same idea.
    Anyway, maybe we’re actually on the same page and those differences are what you meant by “pretty different algorithm”. But if not, I’d be very interested to hear what you think the key differences are. (I’m working on yet another approach and suspect more and more strongly that it’s very similar to both causal scrubbing and Stanford’s causal abstraction approach, so would be really good to know if I’m misunderstanding anything.)
    FWIW, I would agree that the motivation of the Stanford authors seems somewhat different, i.e. they want to use this measurement of explanation quality in different ways. I’m less interested in that difference right now.
    - Nora Belrose May 30, 2023, 5:05 PM
      LW: 6 AF: 4
      0
      AF Parent
      FWIW it appears that out of the 4 differences you cited here, only one of them (the relaxation of the restriction that the scrubbed output must be the same) still holds as of this January paper from Geiger’s group https://arxiv.org/abs/2301.04709. So the methods are even more similar than you thought.
      - Erik Jenner May 30, 2023, 5:45 PM
        1 point
        1
        Parent
        Yeah, that seems to be the most important remaining difference now that Atticus is also using multiple interventions at once. Though I think the metrics are also still different? (ofc that’s pretty orthogonal to the main methods)
        
        My sense now is that the types of interventions are bigger difference than I thought when writing that comment. In particular, as far as I can tell, causal scrubbing shouldn’t be thought of as just doing a subset of the interventions, it also does some additional things (basically because causal abstractions don’t treeify so are more limited in that regard). And there’s a closely related difference in that causal scrubbing never compares to the output of the hypothesis, just different outputs of G.
        
        But it also seems plausible that this still turns out not to matter too much in terms of which hypotheses are accepted/rejected. (There are definitely some examples of disagreements between the two methods, but I’m pretty unsure how severe and wide-spread they are.)
    - LawrenceC May 24, 2023, 8:40 PM
      LW: 2 AF: 1
      4
      AF Parent
      Strongly upvoted for a clear explanation!