Thanks for the comment—I trained TopK SAEs with various widths (all fitting within a single GPU) and observed wider SAEs take substantially longer to train, which leads me to believe that the encoder forward pass is a major bottleneck for wall-clock time. The Switch SAE also improves memory efficiency because we do not need to store all M latents.
I’m currently working on implementing expert-parallelism, which I hope will lead to substantial improvements to wall-clock time.
Thanks for the comment—I trained TopK SAEs with various widths (all fitting within a single GPU) and observed wider SAEs take substantially longer to train, which leads me to believe that the encoder forward pass is a major bottleneck for wall-clock time. The Switch SAE also improves memory efficiency because we do not need to store all M latents.
I’m currently working on implementing expert-parallelism, which I hope will lead to substantial improvements to wall-clock time.