If you just want a big parameter-count to wave around, you use a MoE like everyone else optimizing for clickbait. (Or even better, use n-grams so you can talk about having hundreds of trillions of parameters. It’ll work awful compared to NNs, but you’ll have the most parameters!)
A 280b model is nice, but I would definitely trade it for 3 175bs, assuming something interesting was done with them. For example, I would happily trade a fully-trained text Gopher for a Github 175b, a multilingual text 175b, and a DALL-E/Cogview/Make-A-Scene 175b (trained on text+VAE-tokens), say. (Or a Decision Transformer trained on All The Game/Robot/RL Logged Data™, or...)
If you just want a big parameter-count to wave around, you use a MoE like everyone else optimizing for clickbait. (Or even better, use n-grams so you can talk about having hundreds of trillions of parameters. It’ll work awful compared to NNs, but you’ll have the most parameters!)
On that note, did you see that recent Chinese MoE with 174T params, 3 layers, and 96000 experts?
I saw that it was just a tech demo (like DeepSpeed training 1t-dense models for a few steps), and put it on my reading-list. https://www.gwern.net/docs/ai/scaling/moe/2022-01-26-eyeonai-tangjiewudaointerview.pdf suggests they’re serious about using supercomputer-scale computers but they haven’t done so or invested as much compute as Baidu with ERNIE Titan) but looks like not yet, and so not a major priority compared to trying to read all the papers on trained models...* (One reason I am skeptical of MoEs is that for all the Chinese investment into them, nobody seems to report much interesting output from the models, while it seems like anyone who tinkers with the largest dense models will stumble over something like inner-monologues. Do their users show a terminal lack of imagination, are just none of them at all getting translated or included in the papers, or are MoEs just not that great?)
* Even before Chinchilla, it was obvious that training a 1t, much less 100t, dense model to converged/compute-optimal performance, is far harder than demonstrating you can train such a model for a step or two. Similarly for MoEs: if you can train a 100t-parameter MoE to converged/compute-optimal, my suspicion is that you probably shouldn’t’ve bothered in the first place because if a 100t MoE trainable with a contemporary amount of FLOPS is the answer, then the question must be a bad one.
If you just want a big parameter-count to wave around, you use a MoE like everyone else optimizing for clickbait. (Or even better, use n-grams so you can talk about having hundreds of trillions of parameters. It’ll work awful compared to NNs, but you’ll have the most parameters!)
A 280b model is nice, but I would definitely trade it for 3 175bs, assuming something interesting was done with them. For example, I would happily trade a fully-trained text Gopher for a Github 175b, a multilingual text 175b, and a DALL-E/Cogview/Make-A-Scene 175b (trained on text+VAE-tokens), say. (Or a Decision Transformer trained on All The Game/Robot/RL Logged Data™, or...)
On that note, did you see that recent Chinese MoE with 174T params, 3 layers, and 96000 experts?
I saw that it was just a tech demo (like DeepSpeed training 1t-dense models for a few steps), and put it on my reading-list. https://www.gwern.net/docs/ai/scaling/moe/2022-01-26-eyeonai-tangjiewudaointerview.pdf suggests they’re serious about using supercomputer-scale computers but they haven’t done so or invested as much compute as Baidu with ERNIE Titan) but looks like not yet, and so not a major priority compared to trying to read all the papers on trained models...* (One reason I am skeptical of MoEs is that for all the Chinese investment into them, nobody seems to report much interesting output from the models, while it seems like anyone who tinkers with the largest dense models will stumble over something like inner-monologues. Do their users show a terminal lack of imagination, are just none of them at all getting translated or included in the papers, or are MoEs just not that great?)
* Even before Chinchilla, it was obvious that training a 1t, much less 100t, dense model to converged/compute-optimal performance, is far harder than demonstrating you can train such a model for a step or two. Similarly for MoEs: if you can train a 100t-parameter MoE to converged/compute-optimal, my suspicion is that you probably shouldn’t’ve bothered in the first place because if a 100t MoE trainable with a contemporary amount of FLOPS is the answer, then the question must be a bad one.