While it’s fun, and managed to find meaningful correlations for 1-2 neurons / 50, the strongest takeaway for me was the inadequacy of the paradigm ‘what concept does neuron X correspond to’. It’s clear (no surprise, but I’d never had it shoved in my face) that we need a lot of improved theory before we can automate. Maybe AI will automate that theoretical progress but it feels harder, and further from automation, than learning how to handoff solidly paradigmatic interpretability approaches to AI. ManualMechInterp combined with mathematical theory and toy examples seems like the right mix of strategies to me, tho ManualMechInterp shouldn’t be the largest component imo.
Yes, I agree that automated interpretability should be based on scientific theories of DNNs, of which there are many already, and which should be weaved together with existing mech.interp (proto) theories and empirical observations.
Thanks for the pointers!