Outside view #1: How biomimetics has always worked
It seems like ML is different from other domains in that it already relies on incredibly massive automated search, with massive changes in the quality of our inner algorithms despite very little change in our outer algorithms. None of the other domains have this property. So it wouldn’t be too surprising if the only domain in which all the early successes have this property is also the only domain in which the later successes have this property.
Outside view #2: How learning algorithms have always been developed
I don’t think this one is right. If your definition of learning algorithm is the kind of thing that is “able to read a book and then have a better understanding of the subject matter” then it seems like you would be classifying the model learned by GPT-3 as a learning algorithm, since it can read a 1000 word article and then have a better understanding of the subject matter that it can use to e.g. answer questions or write related text.
It seems like your definition of “learning algorithm” is “an algorithm that humans understand,” and then it’s kind of unsurprising that those are the ones designed by humans. Or maybe it’s something about the context size over which the algorithm operates (in which case it’s worth engaging with the obvious trend extrapolation of learned transformers operating competently over longer and longer contexts) or the quality of the learning it performs?
Overall I think I agree that progress in meta-learning over the last few years has been weak enough, and evidence that models like GPT-3 perform competent learning on the inside, that it’s been a modest update towards longer timelines for this kind of fully end-to-end approach. But I think it’s pretty modest, and as far as I can tell the update is more like “would take more like 10^33 operations to produce using foreseeable algorithms rather than 10^28 operations” than “it’s not going to happen.”
3. Computational efficiency: the inner algorithm can run efficiently only to the extent that humans (and the compiler toolchain) generally understand what it’s doing
I don’t think the slowdown is necessarily very large, though I’m not sure exactly what you are claiming. In particular, you can pick a neural network architecture that maps well onto the most efficient hardware that you can build, and then learn how to use the operations that can be efficiently carried out in that architecture. You can still lose something but I don’t think it’s a lot.
You could ask the question formally in specific computational models, e.g. what’s the best fixed homogeneous circuit layout we can find for doing both FFT and quicksort, and how large is the overhead relative to doing one or the other? (Obviously for any two algorithms that you want to simulate the overhead will be at most 2x, so after finding something clean that can do both of them you’d want to look at a third algorithm. I expect that you’re going to be able to do basically any algorithm that anyone cares about with <<10x overhead.)
ML is different from other domains in that it already relies on incredibly massive automated search, with massive changes in the quality of our inner algorithms despite very little change in our outer algorithms.
Yeah, sure, maybe. Outside views only go so far :-)
I concede that even if an evolution-like approach was objectively the best way to build wing-flapping robots, probably those roboticists would not think to actually do that, whereas it probably would occur to ML researchers.
(For what it’s worth—and I don’t think you were disagreeing with this—I would like to emphasize that there have been important changes in outer algorithms too, like the invention of Transformers, BatchNorm, ResNets, and so on, over the past decade, and I expect there to be more such developments in the future. This is in parallel with the ongoing work of scaling-up-the-algorithms-we’ve-already-got, of course.)
you would be classifying the model learned by GPT-3 as a learning algorithm, since it can read a 1000 word article and then have a better understanding of the subject matter that it can use to e.g. answer questions or write related text.
I agree that there’s a sense in which, as GPT-3 goes through its 96 layers, you could say it’s sorta “learning things” in earlier layers and “applying that knowledge” in later layers. I actually had a little discussion of that in an earlier version that I cut out because the article was already very long, and I figured I was already talking in detail about my thoughts on GPT-3 in the subsequent section, with the upshot that I don’t see the GPT-3 trained model as belonging to the category of “the right type of learning algorithm to constitute an AGI by itself” (i.e. without some kind of fine-tuning-as-you-go system) (see “Case 5″). I put a little caveat back in. :-D
I don’t think the slowdown is necessarily very large, though I’m not sure exactly what you are claiming. In particular, you can pick a neural network architecture that maps well onto the most efficient hardware that you can build, and then learn how to use the operations that can be efficiently carried out in that architecture. … You could ask the question formally in specific computational models
Suppose that the idea of tree search had never occurred to any human, and someone programs a learning algorithm with nothing remotely like tree search in it, and then the black box has to “invent tree search”. Or the black box has to “invent TD learning”, or “invent off-policy replay learning”, and so on. I have a hard time imagining this working well.
Like, for tree search, you need to go through this procedure where you keep querying the model, keep track of where you’re at, play through some portion of an imaginary game, then go back and update the model at the end. Can a plain LSTM be trained in such a way that it will start internally doing something equivalent to tree search? If so, how inefficient will it be? That’s where I’m assuming “orders of magnitude”. It seems to me that a plain LSTM isn’t doing the right type of operations to run a tree search algorithm, except in the extreme case that looks something like “a plain LSTM emulating a Turing machine that’s doing tree search”.
Likewise with replay learning—you need to store an unstructured database with a bunch of play-throughs, and then go back and replay them and learn from them when appropriate. Can a plain LSTM do that? Sure, it’s Turing-complete, it can do anything. But a plain LSTM is not the right kind of computation to be storing a big unstructured database of play-throughs and then replaying them when appropriate and learning from the replays.
I agree that this could be investigated in more detail, for example by asking how badly a plain LSTM architecture would struggle to implement something equivalent to tree search, or off-policy replay learning, or TD learning, or whatever.
Then someone might object: Well, this is an irrelevant example, we’re not going to be using a plain LSTM as our learning algorithm. We haven’t been using plain LSTMs for years! We will use new and improved architectures. At least that’s what I would say! And that leads me to the idea that we’ll get AGI via people making better learning algorithms, just like people have been making better learning algorithms for years.
The problem would be solved by doing an automated search over assembly code, but I don’t think that’s feasible.
It seems like ML is different from other domains in that it already relies on incredibly massive automated search, with massive changes in the quality of our inner algorithms despite very little change in our outer algorithms. None of the other domains have this property. So it wouldn’t be too surprising if the only domain in which all the early successes have this property is also the only domain in which the later successes have this property.
I don’t think this one is right. If your definition of learning algorithm is the kind of thing that is “able to read a book and then have a better understanding of the subject matter” then it seems like you would be classifying the model learned by GPT-3 as a learning algorithm, since it can read a 1000 word article and then have a better understanding of the subject matter that it can use to e.g. answer questions or write related text.
It seems like your definition of “learning algorithm” is “an algorithm that humans understand,” and then it’s kind of unsurprising that those are the ones designed by humans. Or maybe it’s something about the context size over which the algorithm operates (in which case it’s worth engaging with the obvious trend extrapolation of learned transformers operating competently over longer and longer contexts) or the quality of the learning it performs?
Overall I think I agree that progress in meta-learning over the last few years has been weak enough, and evidence that models like GPT-3 perform competent learning on the inside, that it’s been a modest update towards longer timelines for this kind of fully end-to-end approach. But I think it’s pretty modest, and as far as I can tell the update is more like “would take more like 10^33 operations to produce using foreseeable algorithms rather than 10^28 operations” than “it’s not going to happen.”
I don’t think the slowdown is necessarily very large, though I’m not sure exactly what you are claiming. In particular, you can pick a neural network architecture that maps well onto the most efficient hardware that you can build, and then learn how to use the operations that can be efficiently carried out in that architecture. You can still lose something but I don’t think it’s a lot.
You could ask the question formally in specific computational models, e.g. what’s the best fixed homogeneous circuit layout we can find for doing both FFT and quicksort, and how large is the overhead relative to doing one or the other? (Obviously for any two algorithms that you want to simulate the overhead will be at most 2x, so after finding something clean that can do both of them you’d want to look at a third algorithm. I expect that you’re going to be able to do basically any algorithm that anyone cares about with <<10x overhead.)
Thanks!
Yeah, sure, maybe. Outside views only go so far :-)
I concede that even if an evolution-like approach was objectively the best way to build wing-flapping robots, probably those roboticists would not think to actually do that, whereas it probably would occur to ML researchers.
(For what it’s worth—and I don’t think you were disagreeing with this—I would like to emphasize that there have been important changes in outer algorithms too, like the invention of Transformers, BatchNorm, ResNets, and so on, over the past decade, and I expect there to be more such developments in the future. This is in parallel with the ongoing work of scaling-up-the-algorithms-we’ve-already-got, of course.)
I agree that there’s a sense in which, as GPT-3 goes through its 96 layers, you could say it’s sorta “learning things” in earlier layers and “applying that knowledge” in later layers. I actually had a little discussion of that in an earlier version that I cut out because the article was already very long, and I figured I was already talking in detail about my thoughts on GPT-3 in the subsequent section, with the upshot that I don’t see the GPT-3 trained model as belonging to the category of “the right type of learning algorithm to constitute an AGI by itself” (i.e. without some kind of fine-tuning-as-you-go system) (see “Case 5″). I put a little caveat back in. :-D
(See also: my discussion comment on this page about GPT-3)
Suppose that the idea of tree search had never occurred to any human, and someone programs a learning algorithm with nothing remotely like tree search in it, and then the black box has to “invent tree search”. Or the black box has to “invent TD learning”, or “invent off-policy replay learning”, and so on. I have a hard time imagining this working well.
Like, for tree search, you need to go through this procedure where you keep querying the model, keep track of where you’re at, play through some portion of an imaginary game, then go back and update the model at the end. Can a plain LSTM be trained in such a way that it will start internally doing something equivalent to tree search? If so, how inefficient will it be? That’s where I’m assuming “orders of magnitude”. It seems to me that a plain LSTM isn’t doing the right type of operations to run a tree search algorithm, except in the extreme case that looks something like “a plain LSTM emulating a Turing machine that’s doing tree search”.
Likewise with replay learning—you need to store an unstructured database with a bunch of play-throughs, and then go back and replay them and learn from them when appropriate. Can a plain LSTM do that? Sure, it’s Turing-complete, it can do anything. But a plain LSTM is not the right kind of computation to be storing a big unstructured database of play-throughs and then replaying them when appropriate and learning from the replays.
I agree that this could be investigated in more detail, for example by asking how badly a plain LSTM architecture would struggle to implement something equivalent to tree search, or off-policy replay learning, or TD learning, or whatever.
Then someone might object: Well, this is an irrelevant example, we’re not going to be using a plain LSTM as our learning algorithm. We haven’t been using plain LSTMs for years! We will use new and improved architectures. At least that’s what I would say! And that leads me to the idea that we’ll get AGI via people making better learning algorithms, just like people have been making better learning algorithms for years.
The problem would be solved by doing an automated search over assembly code, but I don’t think that’s feasible.