This is pretty similar to my thinking on the subject, although I haven’t put nearly the same level of time and effort that you have into doing actual mechanistic interpretability work. Mech interp has the considerable advantage of being a principled way to try to deeply understand neural networks, and if I thought timelines were long it would seem to me like clearly the best approach.
But it seems pretty likely to me that timelines are short, and my best guess is that mech interp can’t scale fast enough to provide the kind of deep and comprehensive understanding of the largest systems that we’d ideally want for aligning AI, and so my main bet is on less-principled higher-level approaches that can keep us safe or at least buy time for more-principled approaches to mature.
That could certainly be wrong! It seems plausible that automated mech interp scales rapidly over the next few years and we can make strong claims about larger and larger feature circuits, and ultimately about NNs as a whole. I very much hope that that’s how things work out; I’m just not convinced it’s probable enough to be the best thing to work on.
I’m looking forward to hopefully seeing responses to your post from folks like @Neel Nanda and @Joseph Bloom who are more bullish on mech interp.
Great points! I agree re: short timelines being the crux.
I chatted to Logan Riggs today, and he argued that improvements in capabilities will make ambitious mech interp possible in time to let us develop solutions to align / monitor powerful AI. This seems very optimistic to say the least, and I remain as yet unconvinced that ‘somehow’ mech interp will buck the historical trend of having been disappointing.
This is pretty similar to my thinking on the subject, although I haven’t put nearly the same level of time and effort that you have into doing actual mechanistic interpretability work. Mech interp has the considerable advantage of being a principled way to try to deeply understand neural networks, and if I thought timelines were long it would seem to me like clearly the best approach.
But it seems pretty likely to me that timelines are short, and my best guess is that mech interp can’t scale fast enough to provide the kind of deep and comprehensive understanding of the largest systems that we’d ideally want for aligning AI, and so my main bet is on less-principled higher-level approaches that can keep us safe or at least buy time for more-principled approaches to mature.
That could certainly be wrong! It seems plausible that automated mech interp scales rapidly over the next few years and we can make strong claims about larger and larger feature circuits, and ultimately about NNs as a whole. I very much hope that that’s how things work out; I’m just not convinced it’s probable enough to be the best thing to work on.
I’m looking forward to hopefully seeing responses to your post from folks like @Neel Nanda and @Joseph Bloom who are more bullish on mech interp.
Great points! I agree re: short timelines being the crux.
I chatted to Logan Riggs today, and he argued that improvements in capabilities will make ambitious mech interp possible in time to let us develop solutions to align / monitor powerful AI. This seems very optimistic to say the least, and I remain as yet unconvinced that ‘somehow’ mech interp will buck the historical trend of having been disappointing.