This post is a link to a video I just made in response to the new work coming out of Anthropic described here and here and discussed on LessWrong here.
In the video I try to puzzle through how best to think about the MLP layers of a transformer in the same spirit as Anthropic is thinking through the self-attention layers.
Well, goodness, it’s really impressive (and touching) that someone absorbed the content of our paper and made a video with thoughts building on it so quickly! It took me a lot longer to understand these ideas.
I’m trying to not work over the holidays, so I’ll restrict myself to a few very quick remarks:
There’s a bunch of stuff buried in the paper’s appendix which you might find interesting, especially the “additional intuition” notes on MLP layers, convolutional-like structure, and bottleneck activations. A lot of it is quite closely related to the things you talked about in your video.
You might be interested in work in the original circuits thread, which focused on reverse engineering convolutional networks with ReLU neurons. Curve Detectors and Curve Circuits are an deep treatment of one case and might shed light on some of the ideas you were thinking about. (For example, you discussed what we call “dataset examples” for a bit.)
LayerNorm in transformers is slightly different from what you describe. There are no interactions between tokens. This is actually the reason LayerNorm is preferred: in autoregressive transformers, one needs to be paranoid about avoiding information leakage from future tokens, and normalization across tokens becomes very complicated as a result, leading to a preference for normalization approaches that are per-token. In any case, there are some minor flow through effects from this to other things you say.
Most transformers prefer GeLU neurons to ReLU neurons.
In general, I’d recommend pushing linearization back until you hit a privileged basis (either a previous MLP layer or the input tokens) rather than the residual stream. My guess is that’s the most interpretable formulation of things. It turns out you can always do this.
I think there’s another important idea that you’re getting close to and I wanted to remark on:
Just as we can linearize the attention layers by freezing the attention patterns, we can linearize a ReLU MLP layer by freezing the “mask” (what you call the referendum).
The issue is that to really leverage this for understanding, one probably needs to understand the information they froze. For example, for attention layers one still needs to look at the attention patterns, and figure out why they attend where they do (the QK circuit) or at least have an empirical theory (eg. “induction heads attend to previous copies shifted one forward”).
Linearizing MLP layers requires you to freeze way more information than attention layers, which makes it harder to “hold in your head.” Additionally, once you understand the information you’ve frozen, you have a theory of the neurons and could proceed via the original circuit-style approach to MLP neurons. In any case, it’s exciting to see other people thinking about this stuff. Happy holidays, and good luck if you’re thinking about this more!
Thanks! Enjoy your holidays!
Well now I feel kind of dumb (for misremembering how LayerNorm works). I’ve actually spent the past day since making the video wondering why information leakage of the form you describe doesn’t occur in most transformers, so it’s honestly kind of a relief to realize this.
It seems to me that ReLU is a reasonable approximation of GELU, even for networks that are actually using GELU. So one can think about the GELU(x)=xΦ(x) as just having a slightly messy mask function (Φ(x)) that is sort-of-well-approximated by the ReLU binary mask function.
Is this video still available somewhere?