Hi Lee, if I may ask, when you say “geometric analysis” of the router, do you mean analysis of the parameters or activations? Are there any papers that perform the sort of analysis you’d like seen done? Asking from the perspective of someone who understands nns thoroughly but is new to mechinterp.
phenomanon
Karma: 3
Composition Circuits in Vision Transformers (Hypothesis)
Thank you for the answer, that makes more sense.
For a batch with activations, we first compute vectors and . represents what proportion of activations are sent to each expert
Hi, I’m not exactly sure where f fits in here. In Figure 1/section 2.2, it seems like x is fed into the router layer, which produces a distribution over the N experts, from which the “best expert” is chosen. I’m not sure where the “proportion of activations” is in that process. To me that sounds like it’s describing something that would be multiplied by x before it’s fed into an expert, but I don’t see that reflected in the diagram or described in section 2.2.
Thank you very much for your reply—I appreciate the commentary and direction