But I don’t expect these kinds of understanding to transfer well to understanding Transformers in general, so I’m not sure it’s high priority.
The point is not necessarily to improve our understanding of Transformers in general, but that if we’re pessimistic about interpretability on dense transformers (like markets are, see below), we might be better off speeding up capabilities on architectures we think are a lot more interpretable.
I’m not saying that MoE are more interpretable in general. I’m saying that for some tasks, the high level view of “which expert is active when and where” may be enough to get a good sense of what is going on.
In particular, I’m almost as pessimistic in finding “search”, or “reward functions”, or “world models”, or “the idea of lying to a human for instrumental reasons” in MoEs as in regular Transformers. The intuition behind that is that MoE is about as useful when you want to do interp as the fact that there are multiple attention heads per Attention layer doing “different discrete things” (though they do things in parallel). The fact that there are multiple heads helps you a bit, but no that much.
This is why I care about transferability of what you learn when it comes to MoEs.
Maybe MoE + sth else could add some safeguards though (in particular, it might be easier to do targeted ablations on MoE than on regular Transformers), but I would be surprised if any safety benefit came from “interp on MoE goes brr”.
The point is not necessarily to improve our understanding of Transformers in general, but that if we’re pessimistic about interpretability on dense transformers (like markets are, see below), we might be better off speeding up capabilities on architectures we think are a lot more interpretable.
I’m not saying that MoE are more interpretable in general. I’m saying that for some tasks, the high level view of “which expert is active when and where” may be enough to get a good sense of what is going on.
In particular, I’m almost as pessimistic in finding “search”, or “reward functions”, or “world models”, or “the idea of lying to a human for instrumental reasons” in MoEs as in regular Transformers. The intuition behind that is that MoE is about as useful when you want to do interp as the fact that there are multiple attention heads per Attention layer doing “different discrete things” (though they do things in parallel). The fact that there are multiple heads helps you a bit, but no that much.
This is why I care about transferability of what you learn when it comes to MoEs.
Maybe MoE + sth else could add some safeguards though (in particular, it might be easier to do targeted ablations on MoE than on regular Transformers), but I would be surprised if any safety benefit came from “interp on MoE goes brr”.