Neel Nanda comments on Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

Neel Nanda 10 Aug 2023 19:45 UTC
LW: 21 AF: 15
9
AF
Great post! I think a really cool research direction in mech interp would be looking for alignment relevant circuits in a misaligned model—it seems like the kind of concrete thing we could do mech interp on today (if we had such a model), and like it would teach us a ton about what to look for when eg auditing a potentially misaligned model. I’d love to hear about any progress you make, and possible room for collaboration.