I’ve been a little pessimistic about mech interp as compared to chain of thought, since CoT already produces very understandable end-to-end explanations for free (assuming faithfulness etc.).
But I’d be much more excited if we could actually understand circuits to the point of replacing them with annotated code or perhaps generating on-demand natural language explanations in an expandable tree. And as long as we can discover the key techniques for doing that, the fiddly cognitive work that would be required to scale it across a whole neural net may be feasible with AI only as capable as o3.
I forgot about faithful CoT and definitely think that should be a “Step 0”. I’m also concerned here that AGI labs just don’t do the reasonable things (ie training for briefness making the CoT more steganographic).
For Mech-interp, ya, we’re currently bottlenecked by:
Finding a good enough unit-of-computation (which would enable most of the higher-guarantee research)
Computing Attention_in--> Attention_out (which Keith got the QK-circuit → Attention pattern working a while ago, but haven’t hooked up w/ the OV-circuit)
I really like your vision for interpretability!
I’ve been a little pessimistic about mech interp as compared to chain of thought, since CoT already produces very understandable end-to-end explanations for free (assuming faithfulness etc.).
But I’d be much more excited if we could actually understand circuits to the point of replacing them with annotated code or perhaps generating on-demand natural language explanations in an expandable tree. And as long as we can discover the key techniques for doing that, the fiddly cognitive work that would be required to scale it across a whole neural net may be feasible with AI only as capable as o3.
Thanks!
I forgot about faithful CoT and definitely think that should be a “Step 0”. I’m also concerned here that AGI labs just don’t do the reasonable things (ie training for briefness making the CoT more steganographic).
For Mech-interp, ya, we’re currently bottlenecked by:
Finding a good enough unit-of-computation (which would enable most of the higher-guarantee research)
Computing Attention_in--> Attention_out (which Keith got the QK-circuit → Attention pattern working a while ago, but haven’t hooked up w/ the OV-circuit)