I saw that it was just a tech demo (like DeepSpeed training 1t-dense models for a few steps), and put it on my reading-list. https://www.gwern.net/docs/ai/scaling/moe/2022-01-26-eyeonai-tangjiewudaointerview.pdf suggests they’re serious about using supercomputer-scale computers but they haven’t done so or invested as much compute as Baidu with ERNIE Titan) but looks like not yet, and so not a major priority compared to trying to read all the papers on trained models...* (One reason I am skeptical of MoEs is that for all the Chinese investment into them, nobody seems to report much interesting output from the models, while it seems like anyone who tinkers with the largest dense models will stumble over something like inner-monologues. Do their users show a terminal lack of imagination, are just none of them at all getting translated or included in the papers, or are MoEs just not that great?)
* Even before Chinchilla, it was obvious that training a 1t, much less 100t, dense model to converged/compute-optimal performance, is far harder than demonstrating you can train such a model for a step or two. Similarly for MoEs: if you can train a 100t-parameter MoE to converged/compute-optimal, my suspicion is that you probably shouldn’t’ve bothered in the first place because if a 100t MoE trainable with a contemporary amount of FLOPS is the answer, then the question must be a bad one.
I saw that it was just a tech demo (like DeepSpeed training 1t-dense models for a few steps), and put it on my reading-list. https://www.gwern.net/docs/ai/scaling/moe/2022-01-26-eyeonai-tangjiewudaointerview.pdf suggests they’re serious about using supercomputer-scale computers but they haven’t done so or invested as much compute as Baidu with ERNIE Titan) but looks like not yet, and so not a major priority compared to trying to read all the papers on trained models...* (One reason I am skeptical of MoEs is that for all the Chinese investment into them, nobody seems to report much interesting output from the models, while it seems like anyone who tinkers with the largest dense models will stumble over something like inner-monologues. Do their users show a terminal lack of imagination, are just none of them at all getting translated or included in the papers, or are MoEs just not that great?)
* Even before Chinchilla, it was obvious that training a 1t, much less 100t, dense model to converged/compute-optimal performance, is far harder than demonstrating you can train such a model for a step or two. Similarly for MoEs: if you can train a 100t-parameter MoE to converged/compute-optimal, my suspicion is that you probably shouldn’t’ve bothered in the first place because if a 100t MoE trainable with a contemporary amount of FLOPS is the answer, then the question must be a bad one.