An excellent question. I know those were hypotheses in one-or-more mechanistic interpretability papers I read this year or so, or that I pieced together from a combination of several of them, but I’m afraid I don’t recall the location, nor was I able to find it when I was writing this, which is why I didn’t add a link. I think the first half encoding/second half decoding part of that is fairly widespread and I’ve seen it in several places. However, searching for it on Google, the closest I could find was from the paper Softmax Linear Units (back in 2022):
In summary, the general pattern of observations across layers suggests a rough layout where early layers “de-tokenize,” mapping tokens to fairly concrete concepts (phrases like “machine learning” or words when used in a specific language), the middle of the network deals in more abstract concepts such as “any clause that describes music,” and the later portions of the network “re-tokenize,” converting concrete concepts back into literal tokens to be output. All of this is very preliminary and requires much more detailed study to draw solid conclusions. However, our experience in vision was that having a sense of what kinds of features tend to exist at different layers was very helpful as high-level orientation for understanding models (see especially ). It seems promising that we may be developing something similar here.
which is not quite the same thing, though there is some resemblance. There’s also a relation to the encoding and decoding concepts of sections 2 and 3 of the recent more theoretical paper White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?, though that doesn’t make it clear that equal numbers of layers are required. (That also explains why the behavior of so-called “decoder-only” and “encoder-decoder” transformer models are so similar.)
The “baseline before applying bias” part was I think from one of the papers on lie detection, latent knowledge extraction and/or bias, of which there have been a whole series this year, some from Paul Christiano’s team and some from others.
On where to read more, I’d suggest starting with the Anthropic research blog where they discuss their research papers for the last year or so: roughly 40% of those are on mechanistic interpretability, and there’s always a blog post summary for a science-interested-layman reader with a link to the actual paper. There’s also some excellent work coming from other places, such as Neel Nanda, who similarly has a blog website, and the ELK work under Paul Christiano. Overall we’ve made quite a bit of progress on interpretability in the last 18 months or so, though there’s still a long way to go.
An excellent question. I know those were hypotheses in one-or-more mechanistic interpretability papers I read this year or so, or that I pieced together from a combination of several of them, but I’m afraid I don’t recall the location, nor was I able to find it when I was writing this, which is why I didn’t add a link. I think the first half encoding/second half decoding part of that is fairly widespread and I’ve seen it in several places. However, searching for it on Google, the closest I could find was from the paper Softmax Linear Units (back in 2022):
which is not quite the same thing, though there is some resemblance. There’s also a relation to the encoding and decoding concepts of sections 2 and 3 of the recent more theoretical paper White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?, though that doesn’t make it clear that equal numbers of layers are required. (That also explains why the behavior of so-called “decoder-only” and “encoder-decoder” transformer models are so similar.)
The “baseline before applying bias” part was I think from one of the papers on lie detection, latent knowledge extraction and/or bias, of which there have been a whole series this year, some from Paul Christiano’s team and some from others.
On where to read more, I’d suggest starting with the Anthropic research blog where they discuss their research papers for the last year or so: roughly 40% of those are on mechanistic interpretability, and there’s always a blog post summary for a science-interested-layman reader with a link to the actual paper. There’s also some excellent work coming from other places, such as Neel Nanda, who similarly has a blog website, and the ELK work under Paul Christiano. Overall we’ve made quite a bit of progress on interpretability in the last 18 months or so, though there’s still a long way to go.