Ansh Radhakrishnan comments on Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

Ansh Radhakrishnan 3 Dec 2022 18:42 UTC
LW: 8 AF: 4
1
AF
in particular how much causal scrubbing can be turned into an exploratory tool to find circuits rather than just to verify them
I’d like to flag that this has been pretty easy to do—for instance, this process can look like resample ablating different nodes of the computational graph (eg each attention head/MLP), finding the nodes that when ablated most impact the model’s performance and are hence important, and then recursively searching for nodes that are relevant to the current set of important nodes by ablating nodes upstream to each important node.
- Neel Nanda 4 Dec 2022 12:55 UTC
  LW: 3 AF: 3
  0
  AF Parent
  Exciting! I look forward to the first “interesting circuit entirely derived by causal scrubbing” paper