I think that the answer is no
In this “VRAM-constrained regime,” MoE models (trained from scratch) are nowhere near competitive with dense LLMs.
Curious whether your high-level thoughts on these topics still hold or have changed.
Curious whether your high-level thoughts on these topics still hold or have changed.