phenomanon comments on Efficient Dictionary Learning with Switch Sparse Autoencoders

phenomanon 23 Jul 2024 19:19 UTC
2 points
0
For a batch $B$ with $T$ activations, we first compute vectors $f \in R^{N}$ and $P \in R^{N}$ . $f$ represents what proportion of activations are sent to each expert
Hi, I’m not exactly sure where f fits in here. In Figure 1/section 2.2, it seems like x is fed into the router layer, which produces a distribution over the N experts, from which the “best expert” is chosen. I’m not sure where the “proportion of activations” is in that process. To me that sounds like it’s describing something that would be multiplied by x before it’s fed into an expert, but I don’t see that reflected in the diagram or described in section 2.2.
- Anish Mudide 23 Jul 2024 20:44 UTC
  4 points
  0
  Parent
  Thanks for the question -- $f$ is calculated over an entire batch of inputs, not a single $x$ . Figure 1 shows how the Switch SAE processes a single residual stream activation $x$ .
  - phenomanon 24 Jul 2024 18:49 UTC
    2 points
    0
    Parent
    Thank you for the answer, that makes more sense.