For a batch B with T activations, we first compute vectors f∈RN and P∈RN. f represents what proportion of activations are sent to each expert
Hi, I’m not exactly sure where f fits in here. In Figure 1/section 2.2, it seems like x is fed into the router layer, which produces a distribution over the N experts, from which the “best expert” is chosen. I’m not sure where the “proportion of activations” is in that process. To me that sounds like it’s describing something that would be multiplied by x before it’s fed into an expert, but I don’t see that reflected in the diagram or described in section 2.2.
Thanks for the question -- f is calculated over an entire batch of inputs, not a single x. Figure 1 shows how the Switch SAE processes a single residual stream activation x.
Hi, I’m not exactly sure where f fits in here. In Figure 1/section 2.2, it seems like x is fed into the router layer, which produces a distribution over the N experts, from which the “best expert” is chosen. I’m not sure where the “proportion of activations” is in that process. To me that sounds like it’s describing something that would be multiplied by x before it’s fed into an expert, but I don’t see that reflected in the diagram or described in section 2.2.
Thanks for the question -- f is calculated over an entire batch of inputs, not a single x. Figure 1 shows how the Switch SAE processes a single residual stream activation x.
Thank you for the answer, that makes more sense.