Interpretability research is popular, and interpretability tools play a role in almost every agenda for making AI safe. However, for all the interpretability work that exists, there is a significant gap between the research and engineering applications. If one of our main goals for interpretability research is to help us with aligning highly intelligent AI systems in high stakes settings, shouldn’t we be seeing tools that are more helpful on real world problems?
This 12-post sequence argues for taking an engineering approach to interpretability research. And from this lens, it analyzes existing work and proposes directions for moving forward.