I think strategically, only automated and black-box approaches to interpretability make practical sense to develop now.
Just on this, I (not part of SERI MATS but working from their office) had a go at a basic ‘make ChatGPT interpret this neuron’ system for the interpretability hackathon over the weekend. (GitHub)
While it’s fun, and managed to find meaningful correlations for 1-2 neurons / 50, the strongest takeaway for me was the inadequacy of the paradigm ‘what concept does neuron X correspond to’. It’s clear (no surprise, but I’d never had it shoved in my face) that we need a lot of improved theory before we can automate. Maybe AI will automate that theoretical progress but it feels harder, and further from automation, than learning how to handoff solidly paradigmatic interpretability approaches to AI. ManualMechInterp combined with mathematical theory and toy examples seems like the right mix of strategies to me, tho ManualMechInterp shouldn’t be the largest component imo.
FWIW, I agree with learning history/philosophy of science as a good source of models and healthy experimental thought patterns. I was recommended Hasok Chang’s books (Inventing Temperature, Is Water H20) by folks at Conjecture and I’d heartily recommend them in turn.
I know the SERI MATS technical lead @Joe_Collman spends a lot of his time thinking about how they can improve feedback loops, he might be interested in a chat.
You also might be interested in Mike Webb’s project to set up programs to pass quality decision-making from top researchers to students, being tested on SERI MATS people at the moment.
While it’s fun, and managed to find meaningful correlations for 1-2 neurons / 50, the strongest takeaway for me was the inadequacy of the paradigm ‘what concept does neuron X correspond to’. It’s clear (no surprise, but I’d never had it shoved in my face) that we need a lot of improved theory before we can automate. Maybe AI will automate that theoretical progress but it feels harder, and further from automation, than learning how to handoff solidly paradigmatic interpretability approaches to AI. ManualMechInterp combined with mathematical theory and toy examples seems like the right mix of strategies to me, tho ManualMechInterp shouldn’t be the largest component imo.
Just on this, I (not part of SERI MATS but working from their office) had a go at a basic ‘make ChatGPT interpret this neuron’ system for the interpretability hackathon over the weekend. (GitHub)
While it’s fun, and managed to find meaningful correlations for 1-2 neurons / 50, the strongest takeaway for me was the inadequacy of the paradigm ‘what concept does neuron X correspond to’. It’s clear (no surprise, but I’d never had it shoved in my face) that we need a lot of improved theory before we can automate. Maybe AI will automate that theoretical progress but it feels harder, and further from automation, than learning how to handoff solidly paradigmatic interpretability approaches to AI. ManualMechInterp combined with mathematical theory and toy examples seems like the right mix of strategies to me, tho ManualMechInterp shouldn’t be the largest component imo.
FWIW, I agree with learning history/philosophy of science as a good source of models and healthy experimental thought patterns. I was recommended Hasok Chang’s books (Inventing Temperature, Is Water H20) by folks at Conjecture and I’d heartily recommend them in turn.
I know the SERI MATS technical lead @Joe_Collman spends a lot of his time thinking about how they can improve feedback loops, he might be interested in a chat.
You also might be interested in Mike Webb’s project to set up programs to pass quality decision-making from top researchers to students, being tested on SERI MATS people at the moment.
Yes, I agree that automated interpretability should be based on scientific theories of DNNs, of which there are many already, and which should be weaved together with existing mech.interp (proto) theories and empirical observations.
Thanks for the pointers!