gwern comments on NVIDIA and Microsoft releases 530B parameter transformer model, Megatron-Turing NLG

gwern 17 Oct 2021 1:13 UTC
8 points
0

I think that this is notable because it’s the first time we’ve really seen powerful AI research orgs sharing infra like this. Typically everyone wants to do everything bespoke and make their work all on their own. This is good for branding but obviously a lot more work.

It may just be the incentives. “Commoditize your complement”.

Nvidia wants to sell GPUs, and that’s pretty much it; any services they sell are tightly coupled to the GPUs, they don’t sell smartphones or banner ads. And Microsoft wants to sell MS Azure, and to a lesser extent, business SaaS, and while it has many fingers in many pies, those tails do not wag the dog. NV/MS releasing tooling like DeepSpeed, and being pragmatic about using The Pile since it exists (instead of spending scarce engineer time on making one’s own just to have their own), is consistent with that.

In contrast, Facebook, Google, Apple, AliBaba, Baidu—all of these sell different things, typically far more integrated into a service/website/platform, like smartphones vertically integrated from the web advertising down to the NN ASICs on their in-house smartphones. Google may be unusually open in terms of releasing research, but they still won’t release the actual models trained on JFT-300M/B or web scrapes like their ALIGN, or models touching on the core business vitals like advertising, or their best models like LaMDA* or MUM or Pathways. Even academics ‘sell’ very different things than happy endusers on Nvidia GPUs / MS cloud VMs: prestige, citations, novelty, secret sauces, moral high grounds. Not necessarily open data and working code.

* The split incentives lead to some strange behavior, like the current situation where there’s already like 6 notable Google-authored papers on LaMDA revealing fascinating capabilities like general text style transfer… all of which won’t use its name and only refer to it as “a large language model” or something. (Sometimes they’ll generously specify the model in question is O(100b) parameters.)