Logan Riggs comments on Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability

Logan Riggs 18 Jun 2024 13:57 UTC
3 points
0
Thanks for the correction! What I meant was figure 7 is better modeled as “these neurons are not monosemantic”since their co-activation has a consistent effect (upweighting 9) which isn’t captured by any individual component, and (I predict) these neurons would do different things on different prompts.

But I think I see where you’re coming from now, so the above is tangential. You’re just decomposing the logits using previous layers components. So even though intermediate layers logit contribution won’t make any sense (from tuned lens) that’s fine.

It is interesting in your example of the first two layers counteracting each other. Surely this isn’t true in general, but it could be a common theme of later layers counteracting bigrams (what the embedding is doing?) based off context.