Anish Mudide comments on Efficient Dictionary Learning with Switch Sparse Autoencoders

Anish Mudide 23 Jul 2024 20:17 UTC
3 points
1
Thanks for the comment—I trained TopK SAEs with various widths (all fitting within a single GPU) and observed wider SAEs take substantially longer to train, which leads me to believe that the encoder forward pass is a major bottleneck for wall-clock time. The Switch SAE also improves memory efficiency because we do not need to store all $M$ latents.
I’m currently working on implementing expert-parallelism, which I hope will lead to substantial improvements to wall-clock time.