The Engineer’s Interpretability Sequence

Interpretability research is popular, and interpretability tools play a role in almost every agenda for making AI safe. However, for all the interpretability work that exists, there is a significant gap between the research and engineering applications. If one of our main goals for interpretability research is to help us with aligning highly intelligent AI systems in high stakes settings, shouldn’t we be seeing tools that are more helpful on real world problems?

This 12-post sequence argues for taking an engineering approach to interpretability research. And from this lens, it analyzes existing work and proposes directions for moving forward.

The Eng­ineer’s In­ter­pretabil­ity Se­quence (EIS) I: Intro

EIS II: What is “In­ter­pretabil­ity”?

EIS III: Broad Cri­tiques of In­ter­pretabil­ity Research

EIS IV: A Spotlight on Fea­ture At­tri­bu­tion/​Saliency

EIS V: Blind Spots In AI Safety In­ter­pretabil­ity Research

EIS VI: Cri­tiques of Mechanis­tic In­ter­pretabil­ity Work in AI Safety

EIS VII: A Challenge for Mechanists

EIS VIII: An Eng­ineer’s Un­der­stand­ing of De­cep­tive Alignment

EIS IX: In­ter­pretabil­ity and Adversaries

EIS X: Con­tinual Learn­ing, Mo­du­lar­ity, Com­pres­sion, and Biolog­i­cal Brains

EIS XI: Mov­ing Forward

EIS XII: Sum­mary

EIS XIII: Reflec­tions on An­thropic’s SAE Re­search Circa May 2024

EIS XIV: Is mechanis­tic in­ter­pretabil­ity about to be prac­ti­cally use­ful?