Nice work, these seem like interesting and useful results!
High level question/comment which might be totally off: one benefit of having a single, large, SAE neuron space that each token gets projected into is that features don’t get in each other’s way, except insofar as you’re imposing sparsity. Like, your “I’m inside a parenthetical” and your “I’m attempting a coup” features will both activate in the SAE hidden layer, as long as they’re in the top k features (for some sparsity). But introducing switch SAEs breaks that: if these two features are in different experts, only one of them will activate in the SAE hidden layer (based on whatever your gating learned).
The obvious reply is “but look at the empirical results you fool! The switch SAEs are pretty good!” And that’s fair. I weakly expect what is happening in your experiment is that similar but slightly specialized features are being learned by each expert (a testable hypothesis), and maybe you get enough of this redundancy that it’s fine e.g,. the expert with “I’m inside a parenthetical” also has a “Words relevant to coups” feature and this is enough signal for coup detection in that expert.
Again, maybe this worry is totally off or I’m misunderstanding something.
Thanks for your comment! I believe your concern was echoed by Lee and Arthur in their comments and is completely valid. This work is primarily a proof-of-concept that we can successfully scale SAEs by directly applying MoE, but I suspect that we will need to make tweaks to the architecture.
Nice work, these seem like interesting and useful results!
High level question/comment which might be totally off: one benefit of having a single, large, SAE neuron space that each token gets projected into is that features don’t get in each other’s way, except insofar as you’re imposing sparsity. Like, your “I’m inside a parenthetical” and your “I’m attempting a coup” features will both activate in the SAE hidden layer, as long as they’re in the top k features (for some sparsity). But introducing switch SAEs breaks that: if these two features are in different experts, only one of them will activate in the SAE hidden layer (based on whatever your gating learned).
The obvious reply is “but look at the empirical results you fool! The switch SAEs are pretty good!” And that’s fair. I weakly expect what is happening in your experiment is that similar but slightly specialized features are being learned by each expert (a testable hypothesis), and maybe you get enough of this redundancy that it’s fine e.g,. the expert with “I’m inside a parenthetical” also has a “Words relevant to coups” feature and this is enough signal for coup detection in that expert.
Again, maybe this worry is totally off or I’m misunderstanding something.
Thanks for your comment! I believe your concern was echoed by Lee and Arthur in their comments and is completely valid. This work is primarily a proof-of-concept that we can successfully scale SAEs by directly applying MoE, but I suspect that we will need to make tweaks to the architecture.