Anthropic’s approach doesn’t seem to have panned out
Please don’t take that tweet as evidence that mech interp is doomed! Much attention is on sparse autoencoders nowadays, which seem like a cool and promising approach
We are finding a bunch of insights about the internal features and circuits inside models that I believe to be true, and developing useful techniques like sparse autoencoders and activation patching that expand the space of what we can do. We’re starting to see signs of life of actually doing things with mech interp, though it’s early days. I think skepticism is reasonable, and we’re still far from actually mattering for alignment, but I feel like the field is making real progress and is far from failed
Please don’t take that tweet as evidence that mech interp is doomed! Much attention is on sparse autoencoders nowadays, which seem like a cool and promising approach
Tweet link removed.
Thanks! I will separately say that I disagree with the statement regardless of whether you’re treating my tweet as evidence
In what sense do you consider the mechinterp paradigm that originated with Olah, to be working?
We are finding a bunch of insights about the internal features and circuits inside models that I believe to be true, and developing useful techniques like sparse autoencoders and activation patching that expand the space of what we can do. We’re starting to see signs of life of actually doing things with mech interp, though it’s early days. I think skepticism is reasonable, and we’re still far from actually mattering for alignment, but I feel like the field is making real progress and is far from failed