After getting to the the part of a demo (one of Neel Nanda’s interp demos) where they talk about the idea of a Logit Lens, and Layer Attribution, I have a bit of trouble visualizing it as I could have for simpler concepts (e.g. residual streams, which were indeed drawn as my primary method of comprehension). Anybody have good resources for illustrations? (I know Nanda had a great colorful runthrough but it was only at a high level and for encoder-decoder machines, and thus not generalizable)
[Question] Transformer Mech Interp: Any visualizations?
After getting to the the part of a demo (one of Neel Nanda’s interp demos) where they talk about the idea of a Logit Lens, and Layer Attribution, I have a bit of trouble visualizing it as I could have for simpler concepts (e.g. residual streams, which were indeed drawn as my primary method of comprehension). Anybody have good resources for illustrations? (I know Nanda had a great colorful runthrough but it was only at a high level and for encoder-decoder machines, and thus not generalizable)