I primarily see mechanistic interpretability as a potential path towards understanding how models develop capabilities and processes—especially those that may represent misalignment. Hence, I view it as a means to monitor and align, not so much as to directly improve systems (unless of course we are able to include interpretability in the training loop).
I primarily see mechanistic interpretability as a potential path towards understanding how models develop capabilities and processes—especially those that may represent misalignment. Hence, I view it as a means to monitor and align, not so much as to directly improve systems (unless of course we are able to include interpretability in the training loop).