Developmental interpretability is a new alignment research agenda that studies how structure forms in neural networks.
Developmental interpretability is grounded in Singular Learning Theory (SLT) and aims to build tools for detecting, locating, interpreting, and preventing the sudden onset of (dangerous) capabilities.