I’m currently excited about a “macro-interpretability” paradigm. To quote Joseph Bloom:
TLDR: Documenting existing circuits is good but explaining what relationship circuits have to each other within the model, such as by understanding how the model allocated limited resources such as residual stream and weights between different learnable circuit seems important.
The general topic I think we are getting at is something like “circuit economics”. The thing I’m trying to gesture at is that while circuits might deliver value in distinct ways (such as reducing loss on different inputs, activating on distinct patterns), they share capacity in weights (see polysemantic and capacity in neural networks) and I guess “bandwidth” (getting penalized for interfering signals in activations). There are a few reasons why I think this feels like economics which include: scarce resources, value chains (features composed of other features) and competition (if a circuit is predicting something well with one heuristic, maybe there will be smaller gradient updates to encourage another circuit learning a different heuristic to emerge).
So to tie this back to your post and Alex’s comment “which seems like it would cut away exponentially many virtual heads? That would be awfully convenient for interpretability.”. I think that what interpretability has recently dealt with in elucidating specific circuits is something like “micro-interpretability” and is akin to microeconomics. However this post seems to show a larger trend ie “macro-interpretability” which would possibly affect which of such circuits are possible/likely to be in the final model.
I’m also excited by tactics like “fully reverse engineer the important bits of a toy model, and then consider what tactics and approaches would—in hindsight—have quickly led you to understand the important bits of the model’s decision-making.”
I’m currently excited about a “macro-interpretability” paradigm. To quote Joseph Bloom:
I’m also excited by tactics like “fully reverse engineer the important bits of a toy model, and then consider what tactics and approaches would—in hindsight—have quickly led you to understand the important bits of the model’s decision-making.”