(Note: This was a post, but in retrospect was probably better to be posted as a shortform)
(Epistemic Status: 20-minute worth of thinking, haven’t done any builder/breaker on this yet although I plan to, and would welcome any attempts in the comment)
Have an algorithmic task whose input/output pair could (in reasonable algorithmic complexity) be generated using highly specific combination of modular components (e.g., basic arithmetic, combination of random NN module outputs, etc).
Train a small transformer (or anything, really) on the input/output pairs.
Take a large transformer that takes the activation/weights, and outputs a computational graph.
Train that large transformer over the small transformer, across a diverse set of such algorithmic tasks (probably automatically generated) with varying complexity. Now you have a general tool that takes in a set of high-dimensional matrices and backs-out a simple computational graph, great! Let’s call it Inspector.
Apply the Inspector in real models and see if it recovers anything we might expect (like induction heads).
To go a step further, apply the Inspector to itself. Maybe we might back-out a human implementable general solution for mechanistic interpretability! (Or, at least let us build a better intuition towards the solution.)
(This probably won’t work, or at least isn’t as simple as described above. Again, welcome any builder/breaker attempts!)
(Note: This was a post, but in retrospect was probably better to be posted as a shortform)
(Epistemic Status: 20-minute worth of thinking, haven’t done any builder/breaker on this yet although I plan to, and would welcome any attempts in the comment)
Have an algorithmic task whose input/output pair could (in reasonable algorithmic complexity) be generated using highly specific combination of modular components (e.g., basic arithmetic, combination of random NN module outputs, etc).
Train a small transformer (or anything, really) on the input/output pairs.
Take a large transformer that takes the activation/weights, and outputs a computational graph.
Train that large transformer over the small transformer, across a diverse set of such algorithmic tasks (probably automatically generated) with varying complexity. Now you have a general tool that takes in a set of high-dimensional matrices and backs-out a simple computational graph, great! Let’s call it Inspector.
Apply the Inspector in real models and see if it recovers anything we might expect (like induction heads).
To go a step further, apply the Inspector to itself. Maybe we might back-out a human implementable general solution for mechanistic interpretability! (Or, at least let us build a better intuition towards the solution.)
(This probably won’t work, or at least isn’t as simple as described above. Again, welcome any builder/breaker attempts!)