I’m a new OpenPhil fellow for a mid-career transition—from other spaces in AI/ML—into AI safety, with an interest in interpretability. Given my experience, I bias towards intuitively optimistic about mechanical interpretability in the sense of discovering representations and circuits and trying to make sense of them. But I’ve started my deep dive into the literature. I’d be really interested to hear from @Buck and @ryan_greenblatt and those who share their skepticism about what directions they prefer to invest for their own and their team’s research efforts!
Out of the convo and the comments I got relying more on probes rather than dictionaries and circuits alone. But I feel pretty certain that’s not the complete picture! I came to this convo from the Causal Scrubbing thread which felt exciting to me and like a potential source of inspiration for a mini research project for my fellowship (6 months, including ramp up/learning). I was a bit bummed to learn that the authors found the main benefit of that project to be informing them to abandon mech interp :-D
On a related note, one of the other papers that put me on a path to this thread was this one on Causal Mediation. Fairly long ago at this point I had a phase of interest in Pearl’s causal theory and thought that paper was a nice example of thinking about what’s essentially ablation and activation patching from that point of view. Are there any folks who’ve taken a deeper stab at leveraging some of the more recent theoretical advances in graphical causal theory to do mech interp? Would super appreciate any pointers!
I’m a new OpenPhil fellow for a mid-career transition—from other spaces in AI/ML—into AI safety, with an interest in interpretability. Given my experience, I bias towards intuitively optimistic about mechanical interpretability in the sense of discovering representations and circuits and trying to make sense of them. But I’ve started my deep dive into the literature. I’d be really interested to hear from @Buck and @ryan_greenblatt and those who share their skepticism about what directions they prefer to invest for their own and their team’s research efforts!
Out of the convo and the comments I got relying more on probes rather than dictionaries and circuits alone. But I feel pretty certain that’s not the complete picture! I came to this convo from the Causal Scrubbing thread which felt exciting to me and like a potential source of inspiration for a mini research project for my fellowship (6 months, including ramp up/learning). I was a bit bummed to learn that the authors found the main benefit of that project to be informing them to abandon mech interp :-D
On a related note, one of the other papers that put me on a path to this thread was this one on Causal Mediation. Fairly long ago at this point I had a phase of interest in Pearl’s causal theory and thought that paper was a nice example of thinking about what’s essentially ablation and activation patching from that point of view. Are there any folks who’ve taken a deeper stab at leveraging some of the more recent theoretical advances in graphical causal theory to do mech interp? Would super appreciate any pointers!