Intuitively, I would expect Mixture-of-Experts (MoE) models (e.g. https://arxiv.org/abs/2101.03961) to be a lot more interpretable than dense transformers:
The complexity of an interconnected system increases way faster than linearly with the number of connected units. It is probably at least quadratic. Thus, studying a system with n units is a priori way harder than studying 5 systems with n/5 units. In practice MoE transformers seem to require at least an order of magnitude more parameters than dense transformers for similar capabilities but I still expect the sum of complexity of each expert to be much lower than the complexity of one single dense transformer.
MoE forces specialization and thus gives a strong prior on what a set of neurons is doing. Having a prior is probably very helpful to move faster in doing mechanistic interpretability.
So my question is: Do you think MoEs are more interpretable than dense transformers, and is there some evidence of it or the opposite (e.g. papers or past LW posts)?
I think this question matters because it doesn’t seem implausible to me that MoE models could be at par with dense models in terms of capabilities. And thus it could be an avenue worth pursuing or promoting if we had strong evidence that they were a lot more interpretable. You can see more tentative thoughts on this here (https://twitter.com/Simeon_Cps/status/1609139209914257408?s=20)
[Question] Are Mixture-of-Experts Transformers More Interpretable Than Dense Transformers?
Intuitively, I would expect Mixture-of-Experts (MoE) models (e.g. https://arxiv.org/abs/2101.03961) to be a lot more interpretable than dense transformers:
The complexity of an interconnected system increases way faster than linearly with the number of connected units. It is probably at least quadratic. Thus, studying a system with n units is a priori way harder than studying 5 systems with n/5 units. In practice MoE transformers seem to require at least an order of magnitude more parameters than dense transformers for similar capabilities but I still expect the sum of complexity of each expert to be much lower than the complexity of one single dense transformer.
MoE forces specialization and thus gives a strong prior on what a set of neurons is doing. Having a prior is probably very helpful to move faster in doing mechanistic interpretability.
So my question is: Do you think MoEs are more interpretable than dense transformers, and is there some evidence of it or the opposite (e.g. papers or past LW posts)?
I think this question matters because it doesn’t seem implausible to me that MoE models could be at par with dense models in terms of capabilities. And thus it could be an avenue worth pursuing or promoting if we had strong evidence that they were a lot more interpretable. You can see more tentative thoughts on this here (https://twitter.com/Simeon_Cps/status/1609139209914257408?s=20)