Really excited to see this come out! This feels like one of my favourite interpretability papers in a while. The part of this paper I found most surprising/compelling was just seeing the repeated trend of “you do a natural thing, form a plausible hypothesis with some evidence. Then dig further, discover your hypothesis was flawed in a subtle way, but then patch things to be better”. Eg, the backup name movers are fucking wild.
Really excited to see this come out! This feels like one of my favourite interpretability papers in a while. The part of this paper I found most surprising/compelling was just seeing the repeated trend of “you do a natural thing, form a plausible hypothesis with some evidence. Then dig further, discover your hypothesis was flawed in a subtle way, but then patch things to be better”. Eg, the backup name movers are fucking wild.