Yes—By design, the circuits discovered in this manner might miss how/when something is computed. But we argue that finding the important representations at bottlenecks and their change over layers can provide important/useful information about the model.
One of our future directions, along the direction of crosscoders, is to have “Layer Output Buffer SAEs” that aim to tackle the computation between bottlenecks.
You are correct that the current method will only give a set of features at each selected layer. The edges are intended to show the attention direction within the architecture. We updated it to make it more clear and fix some small issues.
We think there are a few reasons why the results of the ACDC paper do not transfer to our domain:
ACDC and EAP (Syed et al.) rely on overlap with a manual circuit as their metric, whereas we rely on faithfulness and completeness. Because the metrics are different, the comparison isn’t apples-to-apples.
The major difference between methods, as you mentioned, is that we are finding circuits in the SAE basis. This quite possibly accounts for most of the differences.
The SAEs vs neurons comparison is something we definitely want to test. However, the methods mentioned above (ACDC, eap, etc) used transformer components (MLP, Attn) as their units for circuit analysis. Our setup would need to rely on neurons of the residual stream. We don’t think residual neurons are directly comparable to transformer components because they are at different levels of granularity.