Great post! I think a really cool research direction in mech interp would be looking for alignment relevant circuits in a misaligned model—it seems like the kind of concrete thing we could do mech interp on today (if we had such a model), and like it would teach us a ton about what to look for when eg auditing a potentially misaligned model. I’d love to hear about any progress you make, and possible room for collaboration.
Great post! I think a really cool research direction in mech interp would be looking for alignment relevant circuits in a misaligned model—it seems like the kind of concrete thing we could do mech interp on today (if we had such a model), and like it would teach us a ton about what to look for when eg auditing a potentially misaligned model. I’d love to hear about any progress you make, and possible room for collaboration.