If I’m not mistaken, MoE models don’t change the architecture that much, because the number of experts is low (10-100), while the number of neurons per expert is still high (100-10k).
This is why I don’t think your first argument is powerful: the current bottleneck is interpreting any “small” model well (i.e. GPT2-small), and dividing the number of neurons of GPT-3 by 100 won’t help because nobody can interpret models that are 100 times smaller.
That said, I think your second argument is valid: it might make interp easier for some tasks, especially if the breakdown per expert is the same as in our intuitive human understanding, which might make interpreting some behaviors of large MoEs easier than interpreting them in a small Transformer.
But I don’t expect these kinds of understanding to transfer well to understanding Transformers in general, so I’m not sure it’s high priority.
But I don’t expect these kinds of understanding to transfer well to understanding Transformers in general, so I’m not sure it’s high priority.
The point is not necessarily to improve our understanding of Transformers in general, but that if we’re pessimistic about interpretability on dense transformers (like markets are, see below), we might be better off speeding up capabilities on architectures we think are a lot more interpretable.
I’m not saying that MoE are more interpretable in general. I’m saying that for some tasks, the high level view of “which expert is active when and where” may be enough to get a good sense of what is going on.
In particular, I’m almost as pessimistic in finding “search”, or “reward functions”, or “world models”, or “the idea of lying to a human for instrumental reasons” in MoEs as in regular Transformers. The intuition behind that is that MoE is about as useful when you want to do interp as the fact that there are multiple attention heads per Attention layer doing “different discrete things” (though they do things in parallel). The fact that there are multiple heads helps you a bit, but no that much.
This is why I care about transferability of what you learn when it comes to MoEs.
Maybe MoE + sth else could add some safeguards though (in particular, it might be easier to do targeted ablations on MoE than on regular Transformers), but I would be surprised if any safety benefit came from “interp on MoE goes brr”.
If I’m not mistaken, MoE models don’t change the architecture that much, because the number of experts is low (10-100), while the number of neurons per expert is still high (100-10k).
This is why I don’t think your first argument is powerful: the current bottleneck is interpreting any “small” model well (i.e. GPT2-small), and dividing the number of neurons of GPT-3 by 100 won’t help because nobody can interpret models that are 100 times smaller.
That said, I think your second argument is valid: it might make interp easier for some tasks, especially if the breakdown per expert is the same as in our intuitive human understanding, which might make interpreting some behaviors of large MoEs easier than interpreting them in a small Transformer.
But I don’t expect these kinds of understanding to transfer well to understanding Transformers in general, so I’m not sure it’s high priority.
The point is not necessarily to improve our understanding of Transformers in general, but that if we’re pessimistic about interpretability on dense transformers (like markets are, see below), we might be better off speeding up capabilities on architectures we think are a lot more interpretable.
I’m not saying that MoE are more interpretable in general. I’m saying that for some tasks, the high level view of “which expert is active when and where” may be enough to get a good sense of what is going on.
In particular, I’m almost as pessimistic in finding “search”, or “reward functions”, or “world models”, or “the idea of lying to a human for instrumental reasons” in MoEs as in regular Transformers. The intuition behind that is that MoE is about as useful when you want to do interp as the fact that there are multiple attention heads per Attention layer doing “different discrete things” (though they do things in parallel). The fact that there are multiple heads helps you a bit, but no that much.
This is why I care about transferability of what you learn when it comes to MoEs.
Maybe MoE + sth else could add some safeguards though (in particular, it might be easier to do targeted ablations on MoE than on regular Transformers), but I would be surprised if any safety benefit came from “interp on MoE goes brr”.