This post is a link to a video I just made in response to the new work coming out of Anthropic described here and here and discussed on LessWrong here.
In the video I try to puzzle through how best to think about the MLP layers of a transformer in the same spirit as Anthropic is thinking through the self-attention layers.
Mechanistic Interpretability for the MLP Layers (rough early thoughts)
Link post
This post is a link to a video I just made in response to the new work coming out of Anthropic described here and here and discussed on LessWrong here.
In the video I try to puzzle through how best to think about the MLP layers of a transformer in the same spirit as Anthropic is thinking through the self-attention layers.