Sonia Joseph comments on Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

Sonia Joseph 12 Jul 2023 20:22 UTC
4 points
0
Thank you for this write-up!
I am wondering how to relate causal scrubbing to @Arthur Conmy’s ACDC method.
It seems that causal scrubbing becomes relevant when one is testing a relatively specific hypothesis (e.g. induction heads), while ACDC can work with simply a dataset, metric, behavior? If so, would it be accurate to say that ACDC would be a more general pass, and part of an earlier workflow, to develop your hypothesis? And causal scrubbing can validate it? Curious about trade-offs in types of insight, resources, computational complexity, positioning in one’s mech interp workflow, and in what circumstances one would use each.
- Adrià Garriga-alonso 10 Dec 2023 23:22 UTC
  3 points
  0
  Parent
  If you’re still interested in this, we have now added Appendix N to the paper, which explains our final take.
- Arthur Conmy 14 Jul 2023 2:02 UTC
  3 points
  2
  Parent
  (Personal opinion as an ACDC author) I would agree that causal scrubbing aims mostly to test hypotheses and ACDC does a general pass over model components to hopefully develop a hypothesis (see also the causal scrubbing post’s related work and ACDC paper section 2). These generally feel like different steps in a project to me (though some posts in the causal scrubbing sequence show that the method can generate hypotheses, and also ACDC was inspired by causal scrubbing). On the practical side, the considerations in How to Think About Activation Patching may be a helpful field guide for projects where one of several tools might be best for a given use case. I think both causal scrubbing are currently somewhat inefficient for different reasons; causal scrubbing is slow because causal diagrams have lots of paths in them, and ACDC is slow as computational graphs have lots of edges in them (follow up work to ACDC will do gradient descent over the edges).