Backpropagation is nonlocal and might be a couple of orders of magnitude more efficient at learning than whatever it is that brain does. Evolution wouldn’t have been able to take advantage of that, because brains are local. If that holds, it eats up most of the difference.
But also the output of GPT-3 is pretty impressive, I think it’s on its own a strong argument that if it got the right datasets (the kind that are only feasible to generate automatically with a similar system) and slightly richer modalities (like multiple snippets of text and not one), it would be able to learn to maintain coherent thought. That’s the kind of missing algorithms I’m thinking about, generation of datasets and retraining on them, which might require as little as hundreds of cycles of training needed for vanilla GPT-3 to make it stop wandering wildly off-topic, get the point more robustly, and distinguish fiction from reality. Or whatever other crucial faculties that are currently in disarray but could be trained with an appropriate dataset that can be generated with what it already got.
Deep learning/backprop has way more people devoted to improving its efficiency than Hebbian learning.
Those 100x slowdown results were for a Hebbian learner trying to imitate backprop, not learn as efficiently as possible.
Why would training GPT-3 on its own output improve it at all? Scaling laws indicate there’s only so much that more training data can do for you, and artificial data generated by GPT-3 would have worse long term coherence than real data.
Why would training GPT-3 on its own output improve it at all?
Self-distillation is a thing, even outside a DRL setting. (“Best-of” sampling is similar to self-distillation in being a way to get better output out of GPT-3 using just GPT-3.) There’s also an issue of “sampling can prove the presence of knowledge but not the absence” in terms of unlocking abilities that you haven’t prompted—in a very timely paper yesterday, OA demonstrates that GPT-3 models have much better translation abilities than anyone realized, and you can train on its own output to improve its zero-shot translation to English & make that power accessible: “Unsupervised Neural Machine Translation with Generative Language Models Only”, Han et al 2021.
You generate better datasets for playing chess by making a promising move (which is hard to get right without already having trained on a good dataset) and then seeing whether the outcome looks more like winning than for other promising moves (which is easier to check, with blitz games by the same model). The blitz games start out chaotic as well, not predicting actual worth of a move very well, but with each pass of this process the dataset improves, as does the model’s ability to generate even better datasets by playing better blitz.
For language, this could be something like using prompts to set up additional context, generating perhaps a single token continuing some sequence, and evaluating it by continuing it to a full sentence/paragraph and then asking the system what it thinks about the result in some respect. Nobody knows how to do this well for language, to actually get better and not just finetune for some aspect of what’s already there, hence the missing algorithms. (This is implicit in a lot of alignment talk, see for example amplification and debate.) The point for timelines is that this doesn’t incur enormous overhead.
These are all terrible arguments: the 100x slowdown for a vaguely relevant algorithm, the power of evolution, the power of more people working on backprop, the esimates of brain compute themselves. The point is, nonlocality of backprop makes relevance of its compute parity with evolved learning another terrible argument, and the 100x figure is an anchor for this aspect usually not taken into account when applying estimates of brain compute to machine learning.
Backpropagation is nonlocal and might be a couple of orders of magnitude more efficient at learning than whatever it is that brain does. Evolution wouldn’t have been able to take advantage of that, because brains are local. If that holds, it eats up most of the difference.
But also the output of GPT-3 is pretty impressive, I think it’s on its own a strong argument that if it got the right datasets (the kind that are only feasible to generate automatically with a similar system) and slightly richer modalities (like multiple snippets of text and not one), it would be able to learn to maintain coherent thought. That’s the kind of missing algorithms I’m thinking about, generation of datasets and retraining on them, which might require as little as hundreds of cycles of training needed for vanilla GPT-3 to make it stop wandering wildly off-topic, get the point more robustly, and distinguish fiction from reality. Or whatever other crucial faculties that are currently in disarray but could be trained with an appropriate dataset that can be generated with what it already got.
Deep learning/backprop has way more people devoted to improving its efficiency than Hebbian learning.
Those 100x slowdown results were for a Hebbian learner trying to imitate backprop, not learn as efficiently as possible.
Why would training GPT-3 on its own output improve it at all? Scaling laws indicate there’s only so much that more training data can do for you, and artificial data generated by GPT-3 would have worse long term coherence than real data.
Self-distillation is a thing, even outside a DRL setting. (“Best-of” sampling is similar to self-distillation in being a way to get better output out of GPT-3 using just GPT-3.) There’s also an issue of “sampling can prove the presence of knowledge but not the absence” in terms of unlocking abilities that you haven’t prompted—in a very timely paper yesterday, OA demonstrates that GPT-3 models have much better translation abilities than anyone realized, and you can train on its own output to improve its zero-shot translation to English & make that power accessible: “Unsupervised Neural Machine Translation with Generative Language Models Only”, Han et al 2021.
You generate better datasets for playing chess by making a promising move (which is hard to get right without already having trained on a good dataset) and then seeing whether the outcome looks more like winning than for other promising moves (which is easier to check, with blitz games by the same model). The blitz games start out chaotic as well, not predicting actual worth of a move very well, but with each pass of this process the dataset improves, as does the model’s ability to generate even better datasets by playing better blitz.
For language, this could be something like using prompts to set up additional context, generating perhaps a single token continuing some sequence, and evaluating it by continuing it to a full sentence/paragraph and then asking the system what it thinks about the result in some respect. Nobody knows how to do this well for language, to actually get better and not just finetune for some aspect of what’s already there, hence the missing algorithms. (This is implicit in a lot of alignment talk, see for example amplification and debate.) The point for timelines is that this doesn’t incur enormous overhead.
These are all terrible arguments: the 100x slowdown for a vaguely relevant algorithm, the power of evolution, the power of more people working on backprop, the esimates of brain compute themselves. The point is, nonlocality of backprop makes relevance of its compute parity with evolved learning another terrible argument, and the 100x figure is an anchor for this aspect usually not taken into account when applying estimates of brain compute to machine learning.