ML is different from other domains in that it already relies on incredibly massive automated search, with massive changes in the quality of our inner algorithms despite very little change in our outer algorithms.
Yeah, sure, maybe. Outside views only go so far :-)
I concede that even if an evolution-like approach was objectively the best way to build wing-flapping robots, probably those roboticists would not think to actually do that, whereas it probably would occur to ML researchers.
(For what it’s worth—and I don’t think you were disagreeing with this—I would like to emphasize that there have been important changes in outer algorithms too, like the invention of Transformers, BatchNorm, ResNets, and so on, over the past decade, and I expect there to be more such developments in the future. This is in parallel with the ongoing work of scaling-up-the-algorithms-we’ve-already-got, of course.)
you would be classifying the model learned by GPT-3 as a learning algorithm, since it can read a 1000 word article and then have a better understanding of the subject matter that it can use to e.g. answer questions or write related text.
I agree that there’s a sense in which, as GPT-3 goes through its 96 layers, you could say it’s sorta “learning things” in earlier layers and “applying that knowledge” in later layers. I actually had a little discussion of that in an earlier version that I cut out because the article was already very long, and I figured I was already talking in detail about my thoughts on GPT-3 in the subsequent section, with the upshot that I don’t see the GPT-3 trained model as belonging to the category of “the right type of learning algorithm to constitute an AGI by itself” (i.e. without some kind of fine-tuning-as-you-go system) (see “Case 5″). I put a little caveat back in. :-D
I don’t think the slowdown is necessarily very large, though I’m not sure exactly what you are claiming. In particular, you can pick a neural network architecture that maps well onto the most efficient hardware that you can build, and then learn how to use the operations that can be efficiently carried out in that architecture. … You could ask the question formally in specific computational models
Suppose that the idea of tree search had never occurred to any human, and someone programs a learning algorithm with nothing remotely like tree search in it, and then the black box has to “invent tree search”. Or the black box has to “invent TD learning”, or “invent off-policy replay learning”, and so on. I have a hard time imagining this working well.
Like, for tree search, you need to go through this procedure where you keep querying the model, keep track of where you’re at, play through some portion of an imaginary game, then go back and update the model at the end. Can a plain LSTM be trained in such a way that it will start internally doing something equivalent to tree search? If so, how inefficient will it be? That’s where I’m assuming “orders of magnitude”. It seems to me that a plain LSTM isn’t doing the right type of operations to run a tree search algorithm, except in the extreme case that looks something like “a plain LSTM emulating a Turing machine that’s doing tree search”.
Likewise with replay learning—you need to store an unstructured database with a bunch of play-throughs, and then go back and replay them and learn from them when appropriate. Can a plain LSTM do that? Sure, it’s Turing-complete, it can do anything. But a plain LSTM is not the right kind of computation to be storing a big unstructured database of play-throughs and then replaying them when appropriate and learning from the replays.
I agree that this could be investigated in more detail, for example by asking how badly a plain LSTM architecture would struggle to implement something equivalent to tree search, or off-policy replay learning, or TD learning, or whatever.
Then someone might object: Well, this is an irrelevant example, we’re not going to be using a plain LSTM as our learning algorithm. We haven’t been using plain LSTMs for years! We will use new and improved architectures. At least that’s what I would say! And that leads me to the idea that we’ll get AGI via people making better learning algorithms, just like people have been making better learning algorithms for years.
The problem would be solved by doing an automated search over assembly code, but I don’t think that’s feasible.
Thanks!
Yeah, sure, maybe. Outside views only go so far :-)
I concede that even if an evolution-like approach was objectively the best way to build wing-flapping robots, probably those roboticists would not think to actually do that, whereas it probably would occur to ML researchers.
(For what it’s worth—and I don’t think you were disagreeing with this—I would like to emphasize that there have been important changes in outer algorithms too, like the invention of Transformers, BatchNorm, ResNets, and so on, over the past decade, and I expect there to be more such developments in the future. This is in parallel with the ongoing work of scaling-up-the-algorithms-we’ve-already-got, of course.)
I agree that there’s a sense in which, as GPT-3 goes through its 96 layers, you could say it’s sorta “learning things” in earlier layers and “applying that knowledge” in later layers. I actually had a little discussion of that in an earlier version that I cut out because the article was already very long, and I figured I was already talking in detail about my thoughts on GPT-3 in the subsequent section, with the upshot that I don’t see the GPT-3 trained model as belonging to the category of “the right type of learning algorithm to constitute an AGI by itself” (i.e. without some kind of fine-tuning-as-you-go system) (see “Case 5″). I put a little caveat back in. :-D
(See also: my discussion comment on this page about GPT-3)
Suppose that the idea of tree search had never occurred to any human, and someone programs a learning algorithm with nothing remotely like tree search in it, and then the black box has to “invent tree search”. Or the black box has to “invent TD learning”, or “invent off-policy replay learning”, and so on. I have a hard time imagining this working well.
Like, for tree search, you need to go through this procedure where you keep querying the model, keep track of where you’re at, play through some portion of an imaginary game, then go back and update the model at the end. Can a plain LSTM be trained in such a way that it will start internally doing something equivalent to tree search? If so, how inefficient will it be? That’s where I’m assuming “orders of magnitude”. It seems to me that a plain LSTM isn’t doing the right type of operations to run a tree search algorithm, except in the extreme case that looks something like “a plain LSTM emulating a Turing machine that’s doing tree search”.
Likewise with replay learning—you need to store an unstructured database with a bunch of play-throughs, and then go back and replay them and learn from them when appropriate. Can a plain LSTM do that? Sure, it’s Turing-complete, it can do anything. But a plain LSTM is not the right kind of computation to be storing a big unstructured database of play-throughs and then replaying them when appropriate and learning from the replays.
I agree that this could be investigated in more detail, for example by asking how badly a plain LSTM architecture would struggle to implement something equivalent to tree search, or off-policy replay learning, or TD learning, or whatever.
Then someone might object: Well, this is an irrelevant example, we’re not going to be using a plain LSTM as our learning algorithm. We haven’t been using plain LSTMs for years! We will use new and improved architectures. At least that’s what I would say! And that leads me to the idea that we’ll get AGI via people making better learning algorithms, just like people have been making better learning algorithms for years.
The problem would be solved by doing an automated search over assembly code, but I don’t think that’s feasible.