I have been thinking about interpretability for neural networks seriously since mid-2023. The biggest early influences on me that I recall were Olah’s writings and a podcast that Nanda did. The third most important is perhaps this post, which I valued as an opposing opinion to help sharpen up my views.
I’m not sure it has aged well, in the sense that it’s no longer clear to me I would direct someone to read this in 2025. I disagree with many of the object level claims. However, especially when some of the core mechanistic interpretability work is not being subjected to peer review, perhaps I wish there was more sceptical writing like this on balance.
I have been thinking about interpretability for neural networks seriously since mid-2023. The biggest early influences on me that I recall were Olah’s writings and a podcast that Nanda did. The third most important is perhaps this post, which I valued as an opposing opinion to help sharpen up my views.
I’m not sure it has aged well, in the sense that it’s no longer clear to me I would direct someone to read this in 2025. I disagree with many of the object level claims. However, especially when some of the core mechanistic interpretability work is not being subjected to peer review, perhaps I wish there was more sceptical writing like this on balance.