1stuserhere comments on Are Mixture-of-Experts Transformers More Interpretable Than Dense Transformers?

1stuserhere 15 Sep 2023 11:38 UTC
1 point
0
I think that the answer is no
In this “VRAM-constrained regime,” MoE models (trained from scratch) are nowhere near competitive with dense LLMs.
Curious whether your high-level thoughts on these topics still hold or have changed.