I forgot about faithful CoT and definitely think that should be a “Step 0”. I’m also concerned here that AGI labs just don’t do the reasonable things (ie training for briefness making the CoT more steganographic).
For Mech-interp, ya, we’re currently bottlenecked by:
Finding a good enough unit-of-computation (which would enable most of the higher-guarantee research)
Computing Attention_in--> Attention_out (which Keith got the QK-circuit → Attention pattern working a while ago, but haven’t hooked up w/ the OV-circuit)
Thanks!
I forgot about faithful CoT and definitely think that should be a “Step 0”. I’m also concerned here that AGI labs just don’t do the reasonable things (ie training for briefness making the CoT more steganographic).
For Mech-interp, ya, we’re currently bottlenecked by:
Finding a good enough unit-of-computation (which would enable most of the higher-guarantee research)
Computing Attention_in--> Attention_out (which Keith got the QK-circuit → Attention pattern working a while ago, but haven’t hooked up w/ the OV-circuit)