I’m not saying that MoE are more interpretable in general. I’m saying that for some tasks, the high level view of “which expert is active when and where” may be enough to get a good sense of what is going on.
In particular, I’m almost as pessimistic in finding “search”, or “reward functions”, or “world models”, or “the idea of lying to a human for instrumental reasons” in MoEs as in regular Transformers. The intuition behind that is that MoE is about as useful when you want to do interp as the fact that there are multiple attention heads per Attention layer doing “different discrete things” (though they do things in parallel). The fact that there are multiple heads helps you a bit, but no that much.
This is why I care about transferability of what you learn when it comes to MoEs.
Maybe MoE + sth else could add some safeguards though (in particular, it might be easier to do targeted ablations on MoE than on regular Transformers), but I would be surprised if any safety benefit came from “interp on MoE goes brr”.
I’m not saying that MoE are more interpretable in general. I’m saying that for some tasks, the high level view of “which expert is active when and where” may be enough to get a good sense of what is going on.
In particular, I’m almost as pessimistic in finding “search”, or “reward functions”, or “world models”, or “the idea of lying to a human for instrumental reasons” in MoEs as in regular Transformers. The intuition behind that is that MoE is about as useful when you want to do interp as the fact that there are multiple attention heads per Attention layer doing “different discrete things” (though they do things in parallel). The fact that there are multiple heads helps you a bit, but no that much.
This is why I care about transferability of what you learn when it comes to MoEs.
Maybe MoE + sth else could add some safeguards though (in particular, it might be easier to do targeted ablations on MoE than on regular Transformers), but I would be surprised if any safety benefit came from “interp on MoE goes brr”.