Mechanistic Interpretability for the MLP Layers (rough early thoughts)

MadHatter24 Dec 2021 7:24 UTC

12 points

Anthropic (org)AI Interpretability (ML & AI)

This post is a link to a video I just made in response to the new work coming out of Anthropic described here and here and discussed on LessWrong here.

In the video I try to puzzle through how best to think about the MLP layers of a transformer in the same spirit as Anthropic is thinking through the self-attention layers.

MadHatter24 Dec 2021 7:24 UTC

12 points

3 comments1 min readLW link

Anthropic (org)AI Interpretability (ML & AI)