I primarily see mechanistic interpretability as a potential path towards understanding how models develop capabilities and processes—especially those that may represent misalignment. Hence, I view it as a means to monitor and align, not so much as to directly improve systems (unless of course we are able to include interpretability in the training loop).
Did I understand your question correctly? Are you viewing interpretability work as a means to improve AI systems and their capabilities?
I primarily see mechanistic interpretability as a potential path towards understanding how models develop capabilities and processes—especially those that may represent misalignment. Hence, I view it as a means to monitor and align, not so much as to directly improve systems (unless of course we are able to include interpretability in the training loop).