Quintin Pope comments on NVIDIA and Microsoft releases 530B parameter transformer model, Megatron-Turing NLG

Quintin Pope 12 Oct 2021 21:25 UTC
5 points
1. Deep learning/backprop has way more people devoted to improving its efficiency than Hebbian learning.
2. Those 100x slowdown results were for a Hebbian learner trying to imitate backprop, not learn as efficiently as possible.
3. Why would training GPT-3 on its own output improve it at all? Scaling laws indicate there’s only so much that more training data can do for you, and artificial data generated by GPT-3 would have worse long term coherence than real data.
- gwern 12 Oct 2021 22:20 UTC
  6 points
  Parent
  
  Why would training GPT-3 on its own output improve it at all?
  
  Self-distillation is a thing, even outside a DRL setting. (“Best-of” sampling is similar to self-distillation in being a way to get better output out of GPT-3 using just GPT-3.) There’s also an issue of “sampling can prove the presence of knowledge but not the absence” in terms of unlocking abilities that you haven’t prompted—in a very timely paper yesterday, OA demonstrates that GPT-3 models have much better translation abilities than anyone realized, and you can train on its own output to improve its zero-shot translation to English & make that power accessible: “Unsupervised Neural Machine Translation with Generative Language Models Only”, Han et al 2021.
- Vladimir_Nesov 12 Oct 2021 22:56 UTC
  4 points
  Parent
  You generate better datasets for playing chess by making a promising move (which is hard to get right without already having trained on a good dataset) and then seeing whether the outcome looks more like winning than for other promising moves (which is easier to check, with blitz games by the same model). The blitz games start out chaotic as well, not predicting actual worth of a move very well, but with each pass of this process the dataset improves, as does the model’s ability to generate even better datasets by playing better blitz.
  
  For language, this could be something like using prompts to set up additional context, generating perhaps a single token continuing some sequence, and evaluating it by continuing it to a full sentence/paragraph and then asking the system what it thinks about the result in some respect. Nobody knows how to do this well for language, to actually get better and not just finetune for some aspect of what’s already there, hence the missing algorithms. (This is implicit in a lot of alignment talk, see for example amplification and debate.) The point for timelines is that this doesn’t incur enormous overhead.
- Vladimir_Nesov 12 Oct 2021 22:16 UTC
  −1 points
  Parent
  These are all terrible arguments: the 100x slowdown for a vaguely relevant algorithm, the power of evolution, the power of more people working on backprop, the esimates of brain compute themselves. The point is, nonlocality of backprop makes relevance of its compute parity with evolved learning another terrible argument, and the 100x figure is an anchor for this aspect usually not taken into account when applying estimates of brain compute to machine learning.