Idea: Speed up ACDC by batching edge-checks. The intuition is that most circuits will have short paths through the transformer, because Residual Networks Behave Like Ensembles of Relatively Shallow Networks (https://arxiv.org/pdf/1605.06431.pdf). Most edges won’t be in most circuits. Therefore, if you’re checking KL of random-resampling edges e1 and e2, there’s only a small chance that e1 and e2 interact with each other in a way important for the circuit you’re trying to find. So statistically, you can check eg e1, … e100 in a batch, and maybe ablate 95 edges all at once (due to their individual KLs being low).
If true, this predicts that given a threshold T, and for most circuit subgraphs H of the full network graph G, and for the vast majority of e1, e2 in H:
KL(G || H \ {e2} ) - KL(G || H) < T iff KL(G || H \ {e2, e1} ) - KL(G || H \ {e1}) < T
(That is, e1′s inclusion doesn’t change your pruning decision on e2)
Neel Nanda suggests that it’s particularly natural to batch edges in the same layer.
Related: “To what extent does Veit et al still hold on transformer LMs?” feels to me a super tractable and pretty helpful paper someone could work on (Anthropic do some experiments but not many). People discuss this paper a lot with regard to NNs having a simplicity bias, as well as how the paper implies NNs are really stacks of many narrow heuristics rather than deep paths. Obviously empirical work won’t provide crystal clear answers to these questions but this doesn’t mean working on this sort of thing isn’t valuable.
Yeah I agree with the intuition and hadn’t made the explicit connection to the shallow paths paper, thanks!
I would say that Edge Attribution Patching is the extreme form of this https://arxiv.org/abs/2310.10348 , where we just ignored almost all subgraphs H except for G \ {e_1} (removing one edge only), and still got reasonable results, and agrees with some more upcoming results.
Adria (other coauthor on the paper) told me that Redwood ported ACDC to Rust early on, which did provide a useful speedup (idk how much but my guess is 10-100x?) but made it harder to maintain. I’m currently working in Python. I wouldn’t believe those marketing numbers for anything but the Mandelbrot task they test it on, which has particularly high interpreter overhead.
The bigger problem with ACDC is it doesn’t use gradients. Attribution patching fixes this and gets >100x speedups, and there should be even better methods. I don’t expect circuit discovery to be usefully ported to a fast language again, until it is used in production.
I do not understand the appeal of MOJO. If you’re going to overcomplicate Python to the point of C++, just use C++. Or, you know, a saner systems language like Rust. CPython has C interop already.
For what it’s worth, I’ve had to drop from python to C# on occasion for some bottlenecks. In one case, my C# implementation was 418,000 times faster than the python version. That’s a comparison between a poor python implementation and a vectorized C# implementation, but… yeah.
Idea: Speed up ACDC by batching edge-checks. The intuition is that most circuits will have short paths through the transformer, because Residual Networks Behave Like Ensembles of Relatively Shallow Networks (https://arxiv.org/pdf/1605.06431.pdf). Most edges won’t be in most circuits. Therefore, if you’re checking KL of random-resampling edges e1 and e2, there’s only a small chance that e1 and e2 interact with each other in a way important for the circuit you’re trying to find. So statistically, you can check eg e1, … e100 in a batch, and maybe ablate 95 edges all at once (due to their individual KLs being low).
If true, this predicts that given a threshold T, and for most circuit subgraphs H of the full network graph G, and for the vast majority of e1, e2 in H:
KL(G || H \ {e2} ) - KL(G || H) < T
iff
KL(G || H \ {e2, e1} ) - KL(G || H \ {e1}) < T
(That is, e1′s inclusion doesn’t change your pruning decision on e2)
Neel Nanda suggests that it’s particularly natural to batch edges in the same layer.
Related: “To what extent does Veit et al still hold on transformer LMs?” feels to me a super tractable and pretty helpful paper someone could work on (Anthropic do some experiments but not many). People discuss this paper a lot with regard to NNs having a simplicity bias, as well as how the paper implies NNs are really stacks of many narrow heuristics rather than deep paths. Obviously empirical work won’t provide crystal clear answers to these questions but this doesn’t mean working on this sort of thing isn’t valuable.
Yeah I agree with the intuition and hadn’t made the explicit connection to the shallow paths paper, thanks!
I would say that Edge Attribution Patching is the extreme form of this https://arxiv.org/abs/2310.10348 , where we just ignored almost all subgraphs H except for G \ {e_1} (removing one edge only), and still got reasonable results, and agrees with some more upcoming results.
I’m curious to know how much the code could be faster through using a faster programming language. For example, MOJO. @Arthur Conmy
Adria (other coauthor on the paper) told me that Redwood ported ACDC to Rust early on, which did provide a useful speedup (idk how much but my guess is 10-100x?) but made it harder to maintain. I’m currently working in Python. I wouldn’t believe those marketing numbers for anything but the Mandelbrot task they test it on, which has particularly high interpreter overhead.
The bigger problem with ACDC is it doesn’t use gradients. Attribution patching fixes this and gets >100x speedups, and there should be even better methods. I don’t expect circuit discovery to be usefully ported to a fast language again, until it is used in production.
I do not understand the appeal of MOJO. If you’re going to overcomplicate Python to the point of C++, just use C++. Or, you know, a saner systems language like Rust. CPython has C interop already.
For what it’s worth, I’ve had to drop from python to C# on occasion for some bottlenecks. In one case, my C# implementation was 418,000 times faster than the python version. That’s a comparison between a poor python implementation and a vectorized C# implementation, but… yeah.