We are finding a bunch of insights about the internal features and circuits inside models that I believe to be true, and developing useful techniques like sparse autoencoders and activation patching that expand the space of what we can do. We’re starting to see signs of life of actually doing things with mech interp, though it’s early days. I think skepticism is reasonable, and we’re still far from actually mattering for alignment, but I feel like the field is making real progress and is far from failed
We are finding a bunch of insights about the internal features and circuits inside models that I believe to be true, and developing useful techniques like sparse autoencoders and activation patching that expand the space of what we can do. We’re starting to see signs of life of actually doing things with mech interp, though it’s early days. I think skepticism is reasonable, and we’re still far from actually mattering for alignment, but I feel like the field is making real progress and is far from failed