Quintin Pope comments on NVIDIA and Microsoft releases 530B parameter transformer model, Megatron-Turing NLG

Quintin Pope 11 Oct 2021 19:21 UTC
2 points
I don’t think that’s likely. Rates of increase for model size are slowing. Also, the scaling power-laws for performance and parameter count we’ve seen so far suggest future progress is likely ~linear and fairly slow.
- Vladimir_Nesov 12 Oct 2021 0:36 UTC
  2 points
  Parent
  That only covers the possibilities where even more compute than currently available is crucial. In the context of ridiculing GPT-n, that can’t even matter, as with the same dataset and algorithm more compute won’t make it an AGI. And whatever missing algorithms are necessary to do so might turn out to not make the process infeasibly more expensive.
  - Quintin Pope 12 Oct 2021 1:08 UTC
    2 points
    Parent
    “…whatever missing algorithms are necessary to do so might turn out to not make the process infeasibly more expensive.”
    
    I think this is very unlikely because the human brain uses WAY more compute than GPT-3 (something like at least 1000x more on low end estimates). If the brain, optimized for efficiency by millions of years of evolution, is using that much compute, then lots of compute is probably required.
    - Vladimir_Nesov 12 Oct 2021 10:32 UTC
      6 points
      Parent
      Backpropagation is nonlocal and might be a couple of orders of magnitude more efficient at learning than whatever it is that brain does. Evolution wouldn’t have been able to take advantage of that, because brains are local. If that holds, it eats up most of the difference.
      
      But also the output of GPT-3 is pretty impressive, I think it’s on its own a strong argument that if it got the right datasets (the kind that are only feasible to generate automatically with a similar system) and slightly richer modalities (like multiple snippets of text and not one), it would be able to learn to maintain coherent thought. That’s the kind of missing algorithms I’m thinking about, generation of datasets and retraining on them, which might require as little as hundreds of cycles of training needed for vanilla GPT-3 to make it stop wandering wildly off-topic, get the point more robustly, and distinguish fiction from reality. Or whatever other crucial faculties that are currently in disarray but could be trained with an appropriate dataset that can be generated with what it already got.
      - Quintin Pope 12 Oct 2021 21:25 UTC
        5 points
        Parent
        Deep learning/backprop has way more people devoted to improving its efficiency than Hebbian learning.
        Those 100x slowdown results were for a Hebbian learner trying to imitate backprop, not learn as efficiently as possible.
        Why would training GPT-3 on its own output improve it at all? Scaling laws indicate there’s only so much that more training data can do for you, and artificial data generated by GPT-3 would have worse long term coherence than real data.
        gwern 12 Oct 2021 22:20 UTC
        6 points
        Parent
        
        Why would training GPT-3 on its own output improve it at all?
        
        Self-distillation is a thing, even outside a DRL setting. (“Best-of” sampling is similar to self-distillation in being a way to get better output out of GPT-3 using just GPT-3.) There’s also an issue of “sampling can prove the presence of knowledge but not the absence” in terms of unlocking abilities that you haven’t prompted—in a very timely paper yesterday, OA demonstrates that GPT-3 models have much better translation abilities than anyone realized, and you can train on its own output to improve its zero-shot translation to English & make that power accessible: “Unsupervised Neural Machine Translation with Generative Language Models Only”, Han et al 2021.
        Vladimir_Nesov 12 Oct 2021 22:56 UTC
        4 points
        Parent
        You generate better datasets for playing chess by making a promising move (which is hard to get right without already having trained on a good dataset) and then seeing whether the outcome looks more like winning than for other promising moves (which is easier to check, with blitz games by the same model). The blitz games start out chaotic as well, not predicting actual worth of a move very well, but with each pass of this process the dataset improves, as does the model’s ability to generate even better datasets by playing better blitz.
        
        For language, this could be something like using prompts to set up additional context, generating perhaps a single token continuing some sequence, and evaluating it by continuing it to a full sentence/paragraph and then asking the system what it thinks about the result in some respect. Nobody knows how to do this well for language, to actually get better and not just finetune for some aspect of what’s already there, hence the missing algorithms. (This is implicit in a lot of alignment talk, see for example amplification and debate.) The point for timelines is that this doesn’t incur enormous overhead.
        Vladimir_Nesov 12 Oct 2021 22:16 UTC
        −1 points
        Parent
        These are all terrible arguments: the 100x slowdown for a vaguely relevant algorithm, the power of evolution, the power of more people working on backprop, the esimates of brain compute themselves. The point is, nonlocality of backprop makes relevance of its compute parity with evolved learning another terrible argument, and the 100x figure is an anchor for this aspect usually not taken into account when applying estimates of brain compute to machine learning.