I think that the answer is no, and that this reflects a common mental barrier when dealing with gradient descent. You would like different experts to specialize in different things in a human-interpretable way, but Adam doesn’t care what you say you want. Adam only cares about what you actually write down in the loss function.
Generally, a useful line of thinking when dealing with lines of thought like this is to ask yourself if your justification for why something should happen already justifies something that is known to not happen. If so, it’s probably flawed.
In this case there is: as far as I can tell, your justification applies to multiheaded attention (as an improvement over single headed attention). While there has been some attempts to examine MHA as an interpretability magnifying technique, in practice there hasn’t really been much success. Whatever story you tell about why it should work with MoE needs to distinguish MoE from MHA.
I think this question matters because it doesn’t seem implausible to me that MoE models could be at par with dense models in terms of capabilities.
There are two regimes when talking about scaling LLMs, and I think it’s very important to keep them separate when talking about things like this. The literature on scaling laws was written by researchers at a very small number of companies that have a very important and non-standard situation: they are predicated upon the assumption that using twice as many GPUs for half as long doesn’t impact costs. It’s hard to overstate how few people fall into this regime.
I run EleutherAI, the non-profit org that has trained more and larger multi-billion parameter LLMs than any other non-profit in the world, and have worked on three different models that held the title “largest publicly available GPT-3-like LLM in the world.” I have access to thousands of A100 GPUs to train models if I really want to, and recently won a USG grant for 6 million V100 hours. I generally do not operate in this regime.
The regime that almost everyone finds themselves in is one where one day the VRAM runs out. Maybe it’s at a pair of 3090 Tis, maybe it’s at a v3-8 TPU, maybe it’s at a DGX machine. But one day you lose the ability to halve your runtime by doubling the amount of VRAM you are using without impacting costs.
In this “VRAM-constrained regime,” MoE models (trained from scratch) are nowhere near competitive with dense LLMs. While there has been some success at turning dense models into MoE models with less performance loss, that work isn’t really relevant to your hypothesis without a substantial amount of additional intellectual work. MoE models are egregiously inefficient in terms of performance-per-VRAM, but compensate by being more efficient in terms of performance-per-FLOP.
How egregious exactly? Well the first MoE paper I grabbed claims that their 1.1T parameter MoE model performs similarly to a 6.7B parameter dense model and that their 207B parameter MoE model performs similarity to a 1.3B parameter model. To put these numbers in prospective: the (currently unverified) claims NVIDIA is making about quantization on their H100 GPUs would enable you to fit a 640B parameter model on an 8xH100 (80GB) device. So you can use an entire 8xH100 machine to fit a MoE model, or you can use a single 3090 Ti and get better performance (using LLM.int8).
Edit: in a reply to the other answer you say
I’m not saying that MoE are more interpretable in general. I’m saying that for some tasks, the high level view of “which expert is active when and where” may be enough to get a good sense of what is going on.
I had misread your claim, but I think the intent of my response is still valid. Even with this more specific claim, you see people aspiring to believe that this is true for MHA and coming up (largely, albeit not entirely) empty. There’s still a significant burden on you to show why your position is better than the same position with the word “MoE” replaced with “MHA.”
I think that the answer is no, and that this reflects a common mental barrier when dealing with gradient descent. You would like different experts to specialize in different things in a human-interpretable way, but Adam doesn’t care what you say you want. Adam only cares about what you actually write down in the loss function.
Generally, a useful line of thinking when dealing with lines of thought like this is to ask yourself if your justification for why something should happen already justifies something that is known to not happen. If so, it’s probably flawed.
In this case there is: as far as I can tell, your justification applies to multiheaded attention (as an improvement over single headed attention). While there has been some attempts to examine MHA as an interpretability magnifying technique, in practice there hasn’t really been much success. Whatever story you tell about why it should work with MoE needs to distinguish MoE from MHA.
There are two regimes when talking about scaling LLMs, and I think it’s very important to keep them separate when talking about things like this. The literature on scaling laws was written by researchers at a very small number of companies that have a very important and non-standard situation: they are predicated upon the assumption that using twice as many GPUs for half as long doesn’t impact costs. It’s hard to overstate how few people fall into this regime.
I run EleutherAI, the non-profit org that has trained more and larger multi-billion parameter LLMs than any other non-profit in the world, and have worked on three different models that held the title “largest publicly available GPT-3-like LLM in the world.” I have access to thousands of A100 GPUs to train models if I really want to, and recently won a USG grant for 6 million V100 hours. I generally do not operate in this regime.
The regime that almost everyone finds themselves in is one where one day the VRAM runs out. Maybe it’s at a pair of 3090 Tis, maybe it’s at a v3-8 TPU, maybe it’s at a DGX machine. But one day you lose the ability to halve your runtime by doubling the amount of VRAM you are using without impacting costs.
In this “VRAM-constrained regime,” MoE models (trained from scratch) are nowhere near competitive with dense LLMs. While there has been some success at turning dense models into MoE models with less performance loss, that work isn’t really relevant to your hypothesis without a substantial amount of additional intellectual work. MoE models are egregiously inefficient in terms of performance-per-VRAM, but compensate by being more efficient in terms of performance-per-FLOP.
How egregious exactly? Well the first MoE paper I grabbed claims that their 1.1T parameter MoE model performs similarly to a 6.7B parameter dense model and that their 207B parameter MoE model performs similarity to a 1.3B parameter model. To put these numbers in prospective: the (currently unverified) claims NVIDIA is making about quantization on their H100 GPUs would enable you to fit a 640B parameter model on an 8xH100 (80GB) device. So you can use an entire 8xH100 machine to fit a MoE model, or you can use a single 3090 Ti and get better performance (using LLM.int8).
Edit: in a reply to the other answer you say
I had misread your claim, but I think the intent of my response is still valid. Even with this more specific claim, you see people aspiring to believe that this is true for MHA and coming up (largely, albeit not entirely) empty. There’s still a significant burden on you to show why your position is better than the same position with the word “MoE” replaced with “MHA.”
Curious whether your high-level thoughts on these topics still hold or have changed.