Logan Riggs comments on Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability

Logan Riggs 17 Jun 2024 21:35 UTC
2 points
0
To give my speculation (though I upvoted):
I believe this work makes sense overall e.g. let’s do logit lens but for individual model components, but does not compare to baselines or mentions SAEs.
Specifically, what would this method be useful for?
With logit prisms, we can closely examine how the input embeddings, attention heads, and MLP neurons each contribute to the final output.
If it’s for isolating which model components are causally responsible for a task (e.g. addition & Q&A), then does it improve on patching in different activations for these different model components (or the linear approximation method Attribution Patching)? In what way?
Additionally, this post did assume mlp neurons are monosemantic, which isn’t true. This is why we use sparse autoencoders for superposition, which
A final problem is that the logit attribution with the logit lens doesn’t always work out, as shown by the cited Tuned Lens paper (e.g. directly unembedding early layers usually produces nonsense in which logits are upweighted).
I did upvote however because I think the standards for a blog post on LW should be lower. Thank you Raemon also for asking for details, because it sucks to get downvoted and not told why.
- ntt123 18 Jun 2024 1:38 UTC
  3 points
  0
  Parent
  Thank you for the upvote! My main frustration with logit lens and tuned lens is that these methods are kind of ad hoc and do not reflect component contributions in a mathematically sound way. We should be able to rewrite the output as a sum of individual terms, I told myself.
  For the record, I did not assume MLP neurons are monosemantic or polysemantic, and this is why I did not mention SAEs.
  - Logan Riggs 18 Jun 2024 13:57 UTC
    3 points
    0
    Parent
    Thanks for the correction! What I meant was figure 7 is better modeled as “these neurons are not monosemantic”since their co-activation has a consistent effect (upweighting 9) which isn’t captured by any individual component, and (I predict) these neurons would do different things on different prompts.
    
    But I think I see where you’re coming from now, so the above is tangential. You’re just decomposing the logits using previous layers components. So even though intermediate layers logit contribution won’t make any sense (from tuned lens) that’s fine.
    
    It is interesting in your example of the first two layers counteracting each other. Surely this isn’t true in general, but it could be a common theme of later layers counteracting bigrams (what the embedding is doing?) based off context.