The parallelization discussion seems offbase to me. While it is of course important that any individual instance runs not too absurdly slowly, how much faster than realtime it runs isn’t that important, because you would be running many of them in parallel, no? AlphaZero trained in a few wallclock hours not by blazing through games in mere nanoseconds, but by having hundreds or thousands of actors in parallel playing through games at a reasonable speed like 0.05s per turn or something. Or OA5 used minibatches of millions of experiences, and GPT-3 had minibatches of like millions of tokens, IIRC.
If we look at the gradient noise scale, the more complicated the ‘task’ (ie set of tasks), the larger the batch size you need/can use before you are just wasting compute by overly-precisely estimating the gradient for the next update. Presumably any AGI would be training on a lot of tasks as complicated as Go or English text or DoTA2 or more complicated: generative and discriminatory multimodal training on text, video, and photos, DRL training on a bazillion games and procedurally-generated tasks, and so on, and so the optimal minibatch size would be quite large… Unless the hardware overhang is vastly more extreme than anyone anticipates (in which case the debate would be moot for other reasons), it seems like the most plausible answer for “how much parallel hardware can my seed AGI use?” is going to be “how much ya got?”.
This doesn’t guarantee a fast wallclock, of course, but it’s worth noting that in the limit of (full-batch, not stochastic minibatching) gradient descent, you can generally take large steps and converge in relatively few serial iterations compared to SGD. (Bunch of papers on scaling up CNNs to training on thousands of GPUs simultaneously to converge in minutes to seconds rather than days or weeks on smaller but more efficient clusters; yesterday I saw Geiping et al 2021 whose CNN requires 3,000 serial fullbatch iterations vs SGD’s 117,000 serial minibatch iterations, so hypothetically, you could finish in 39x less wallclock if you had ~unlimited compute.)
So even for an incredibly complicated family of tasks, as long as the individual instances can be run at all, the wallclock is potentially quite low because you have model parallelism out the wazoo within and across all of the tasks & modalities & problems, and only need to take relatively few serial updates.
Thanks, that’s really helpful. I’m going to re-frame what you’re saying in the form of a question:
The parallel-experiences question:
Take a model which is akin to an 8-year-old’s brain. (Assume we deeply understand how the learning algorithm works, but not how the trained model works.) Now we make 10 identical copies of that model. For the next hour, we tell one copy to read a book about trucks, and we tell another copy to watch a TV show about baking, and we tell a third copy to build a sandcastle in a VR environment, etc. etc., all in parallel.
At the end of the hour, is it possible to take ALL the things that ALL ten copies learned, and combine them into one model—one model that now has new memories/skills pertaining to trucks AND baking AND sandcastles etc.—and it’s no worse than if the model had done those 10 things in series?
What’s the answer to this question?
Here are three possibilities:
How an ML practitioner would probably answer this question: I think they would say “Yeah duh, we’ve been doing that in ML since forever.” For my part, I do see this as some evidence, but I don’t see it as definitive evidence, because the premise of this post (see Section 1) is that the learning algorithms used by ML practitioners today are substantially different from the within-lifetime learning algorithm used in the brain.
How a biologist would probably answer this question: I think they would say the exact opposite: “No way!! That’s not something brains evolved to do, there’s no reason to expect it to be possible and every reason to think it isn’t. You’re just talking sci-fi nonsense.”
(Well, they would acknowledge that humans working on a group project could go off and study different topics, and then talk to each other and hence teach each other what they’ve learned. But that’s kind of a different thing than what we’re talking about here. In particular, for non-superhuman AIs-in-training, we already have tons of pedagogical materials like human textbooks and lectures. So I don’t see teams-of-AIs-who talk-to-each-other being all that helpful in getting to superhuman faster.)
How I would answer this question: Well I hadn’t thought about it until now, but I think I’m in between. On the one hand, I do think there are some things that need to be learned serially in the human brain learning algorithm. For example, there’s a good reason that people learn multiplication before exponentiation, and exponentiation before nonabelian cohomology, etc. But if the domains are sufficiently different, and if we merge-and-re-split frequently enough, then I’m cautiously optimistic that we could do parallel experiences to some extent, in order to squeeze 30 subjective years of experience into <30 serial subjective years of experience. How much less than 30, I don’t know.
Anyway, in the article I used the biologist answer: “the human brain within-lifetime learning algorithm is not compatible with parallel experiences”. So that would be the most conservative / worst-case assumption.
I am editing the article to note that this is another reason to suspect that training might be faster than the worst-case. Thanks again for pointing that out.
The biologist answer there seems to be question-begging. What reason is there to think it isn’t? Animals can’t split and merge themselves or afford the costs or store datasets for exact replay etc, so they would be unable to do that whether or not it was possible, and so they provide zero evidence about whether their internal algorithms would be able to do it. You might argue that there might be multiple ‘families’ of algorithms all delivering animal-level intelligence, some of which are parallelizable and some not, and for lack of any incentive animals happened to evolve a non-parallelizable one, but this is pure speculation and can’t establish that the non-parallelizable one is superior to the others (much less is the only such family).
From the ML or statistics view, it seems hard for parallelization in learning to not be useful. It’s a pretty broad principle that more data is better than less data. Your neurons are always estimating local gradients with whatever local learning rule they have, and these gradients are (extremely) noisy, and can be improved by more datapoints or rollouts to better estimate the update that jointly optimizes all of the tasks; almost by definition, this seems superior to getting less data one point at a time and doing noisy updates neglecting most of the tasks.
If I am a DRL agent and I have n hypotheses about the current environment, why am I harmed by exploring all n in parallel with n copied agents, observing the updates, and updating my central actor with them all? Even if they don’t produce direct gradients (let’s handwave an architecture where somehow it’d be bad to feed them all in directly, maybe it’s very fragile to off-policyness), they are still producing observations I can use to update my environment model for planning, and I can go through them and do learning before I take any more actions. (If you were in front of a death maze and were watching fellow humans run through it and get hit by the swinging blades or acid mists or ironically-named boulders, you’d surely appreciate being able to watch as many runs as possible by your fellow humans rather than yourself running it.)
In particular, for non-superhuman AIs-in-training, we already have tons of pedagogical materials like human textbooks and lectures. So I don’t see teams-of-AIs-who talk-to-each-other being all that helpful in getting to superhuman faster.
If we look at some of these algorithms, it’s even less compelling to argue that there’s some deep intrinsic reason we want to lock learning to small serial steps—look at expert iteration in AlphaZero, where the improved estimates that the NN is repeatedly retrained on don’t even come from the NN itself, but an ‘expert’ (eg a NN + tree search); what would we gain by ignoring the expert’s provably superior board position evaluations (which would beat the NN if they played) and forcing serial learning? At least, given that MuZero/AlphaZero are so good, this serial biological learning process, whatsoever it may be, has failed to produce superior results to parallelized learning, raising questions about what circumstances exactly yield these serial-mandatory benefits...
The biologist answer there seems to be question-begging
Yeah, I didn’t bother trying to steelman the imaginary biologist. I don’t agree with them anyway, and neither would you.
(I guess I was imagining the biologist belonging to the school of thought (which again I strongly disagree with) that says that intelligence doesn’t work by a few legible algorithmic principles, but is rather a complex intricate Rube Goldberg machine, full of interrelated state variables and so on. So we can’t just barge in and make some major change in how the step-by-step operations work, without everything crashing down. Again, I don’t agree, but I think something like that is a common belief in neuroscience/CogSci/etc.)
it seems hard for parallelization in learning to not be useful … why am I harmed …
I agree with “useful” and “not harmful”. But an interesting question is: Is it SO helpful that parallelization can cut the serial (subjective) time from 30 years to 15 years? Or what about 5 years? 2 years? I don’t know! Again, I think at least some brain-like learning has to be serial (e.g. you need to learn about multiplication before nonabelian cohomology), but I don’t have a good sense for just how much.
We’ve decoded much of the brain, but it’s still mysterious what the brain’s backprop equivalent learning algorithm is, and how it seems to learn so quickly at batch size 1, sidestepping all these gradient noise considerations.
A human may read/hear/think on order a billion-ish words per lifetime or less? GPT-3 trained on a few OOM more, and still would require many OOM more compute/data to hit human perf. Deepmind’s atari agents need about 10^8 frames to match humans and thus are roughly ~3 OOM less data efficient, ignoring human pretraining (true also for EZ, it just uses simulated frames).
Although if you factor in 10 years of human pretraining that’s about 10^8 seconds—so perhaps a big chunk of it is just generic multimodal curriculum pretraining.
The parallelization discussion seems offbase to me. While it is of course important that any individual instance runs not too absurdly slowly, how much faster than realtime it runs isn’t that important, because you would be running many of them in parallel, no? AlphaZero trained in a few wallclock hours not by blazing through games in mere nanoseconds, but by having hundreds or thousands of actors in parallel playing through games at a reasonable speed like 0.05s per turn or something. Or OA5 used minibatches of millions of experiences, and GPT-3 had minibatches of like millions of tokens, IIRC.
If we look at the gradient noise scale, the more complicated the ‘task’ (ie set of tasks), the larger the batch size you need/can use before you are just wasting compute by overly-precisely estimating the gradient for the next update. Presumably any AGI would be training on a lot of tasks as complicated as Go or English text or DoTA2 or more complicated: generative and discriminatory multimodal training on text, video, and photos, DRL training on a bazillion games and procedurally-generated tasks, and so on, and so the optimal minibatch size would be quite large… Unless the hardware overhang is vastly more extreme than anyone anticipates (in which case the debate would be moot for other reasons), it seems like the most plausible answer for “how much parallel hardware can my seed AGI use?” is going to be “how much ya got?”.
This doesn’t guarantee a fast wallclock, of course, but it’s worth noting that in the limit of (full-batch, not stochastic minibatching) gradient descent, you can generally take large steps and converge in relatively few serial iterations compared to SGD. (Bunch of papers on scaling up CNNs to training on thousands of GPUs simultaneously to converge in minutes to seconds rather than days or weeks on smaller but more efficient clusters; yesterday I saw Geiping et al 2021 whose CNN requires 3,000 serial fullbatch iterations vs SGD’s 117,000 serial minibatch iterations, so hypothetically, you could finish in 39x less wallclock if you had ~unlimited compute.)
So even for an incredibly complicated family of tasks, as long as the individual instances can be run at all, the wallclock is potentially quite low because you have model parallelism out the wazoo within and across all of the tasks & modalities & problems, and only need to take relatively few serial updates.
Thanks, that’s really helpful. I’m going to re-frame what you’re saying in the form of a question:
The parallel-experiences question:
Take a model which is akin to an 8-year-old’s brain. (Assume we deeply understand how the learning algorithm works, but not how the trained model works.) Now we make 10 identical copies of that model. For the next hour, we tell one copy to read a book about trucks, and we tell another copy to watch a TV show about baking, and we tell a third copy to build a sandcastle in a VR environment, etc. etc., all in parallel.
At the end of the hour, is it possible to take ALL the things that ALL ten copies learned, and combine them into one model—one model that now has new memories/skills pertaining to trucks AND baking AND sandcastles etc.—and it’s no worse than if the model had done those 10 things in series?
What’s the answer to this question?
Here are three possibilities:
How an ML practitioner would probably answer this question: I think they would say “Yeah duh, we’ve been doing that in ML since forever.” For my part, I do see this as some evidence, but I don’t see it as definitive evidence, because the premise of this post (see Section 1) is that the learning algorithms used by ML practitioners today are substantially different from the within-lifetime learning algorithm used in the brain.
How a biologist would probably answer this question: I think they would say the exact opposite: “No way!! That’s not something brains evolved to do, there’s no reason to expect it to be possible and every reason to think it isn’t. You’re just talking sci-fi nonsense.”
(Well, they would acknowledge that humans working on a group project could go off and study different topics, and then talk to each other and hence teach each other what they’ve learned. But that’s kind of a different thing than what we’re talking about here. In particular, for non-superhuman AIs-in-training, we already have tons of pedagogical materials like human textbooks and lectures. So I don’t see teams-of-AIs-who talk-to-each-other being all that helpful in getting to superhuman faster.)
How I would answer this question: Well I hadn’t thought about it until now, but I think I’m in between. On the one hand, I do think there are some things that need to be learned serially in the human brain learning algorithm. For example, there’s a good reason that people learn multiplication before exponentiation, and exponentiation before nonabelian cohomology, etc. But if the domains are sufficiently different, and if we merge-and-re-split frequently enough, then I’m cautiously optimistic that we could do parallel experiences to some extent, in order to squeeze 30 subjective years of experience into <30 serial subjective years of experience. How much less than 30, I don’t know.
Anyway, in the article I used the biologist answer: “the human brain within-lifetime learning algorithm is not compatible with parallel experiences”. So that would be the most conservative / worst-case assumption.
I am editing the article to note that this is another reason to suspect that training might be faster than the worst-case. Thanks again for pointing that out.
The biologist answer there seems to be question-begging. What reason is there to think it isn’t? Animals can’t split and merge themselves or afford the costs or store datasets for exact replay etc, so they would be unable to do that whether or not it was possible, and so they provide zero evidence about whether their internal algorithms would be able to do it. You might argue that there might be multiple ‘families’ of algorithms all delivering animal-level intelligence, some of which are parallelizable and some not, and for lack of any incentive animals happened to evolve a non-parallelizable one, but this is pure speculation and can’t establish that the non-parallelizable one is superior to the others (much less is the only such family).
From the ML or statistics view, it seems hard for parallelization in learning to not be useful. It’s a pretty broad principle that more data is better than less data. Your neurons are always estimating local gradients with whatever local learning rule they have, and these gradients are (extremely) noisy, and can be improved by more datapoints or rollouts to better estimate the update that jointly optimizes all of the tasks; almost by definition, this seems superior to getting less data one point at a time and doing noisy updates neglecting most of the tasks.
If I am a DRL agent and I have n hypotheses about the current environment, why am I harmed by exploring all n in parallel with n copied agents, observing the updates, and updating my central actor with them all? Even if they don’t produce direct gradients (let’s handwave an architecture where somehow it’d be bad to feed them all in directly, maybe it’s very fragile to off-policyness), they are still producing observations I can use to update my environment model for planning, and I can go through them and do learning before I take any more actions. (If you were in front of a death maze and were watching fellow humans run through it and get hit by the swinging blades or acid mists or ironically-named boulders, you’d surely appreciate being able to watch as many runs as possible by your fellow humans rather than yourself running it.)
If we look at some of these algorithms, it’s even less compelling to argue that there’s some deep intrinsic reason we want to lock learning to small serial steps—look at expert iteration in AlphaZero, where the improved estimates that the NN is repeatedly retrained on don’t even come from the NN itself, but an ‘expert’ (eg a NN + tree search); what would we gain by ignoring the expert’s provably superior board position evaluations (which would beat the NN if they played) and forcing serial learning? At least, given that MuZero/AlphaZero are so good, this serial biological learning process, whatsoever it may be, has failed to produce superior results to parallelized learning, raising questions about what circumstances exactly yield these serial-mandatory benefits...
Yeah, I didn’t bother trying to steelman the imaginary biologist. I don’t agree with them anyway, and neither would you.
(I guess I was imagining the biologist belonging to the school of thought (which again I strongly disagree with) that says that intelligence doesn’t work by a few legible algorithmic principles, but is rather a complex intricate Rube Goldberg machine, full of interrelated state variables and so on. So we can’t just barge in and make some major change in how the step-by-step operations work, without everything crashing down. Again, I don’t agree, but I think something like that is a common belief in neuroscience/CogSci/etc.)
I agree with “useful” and “not harmful”. But an interesting question is: Is it SO helpful that parallelization can cut the serial (subjective) time from 30 years to 15 years? Or what about 5 years? 2 years? I don’t know! Again, I think at least some brain-like learning has to be serial (e.g. you need to learn about multiplication before nonabelian cohomology), but I don’t have a good sense for just how much.
We’ve decoded much of the brain, but it’s still mysterious what the brain’s backprop equivalent learning algorithm is, and how it seems to learn so quickly at batch size 1, sidestepping all these gradient noise considerations.
A human may read/hear/think on order a billion-ish words per lifetime or less? GPT-3 trained on a few OOM more, and still would require many OOM more compute/data to hit human perf. Deepmind’s atari agents need about 10^8 frames to match humans and thus are roughly ~3 OOM less data efficient, ignoring human pretraining (true also for EZ, it just uses simulated frames).
Although if you factor in 10 years of human pretraining that’s about 10^8 seconds—so perhaps a big chunk of it is just generic multimodal curriculum pretraining.