“Alpha Zero scales with more computing power, I think AlphaFold 2 scales with more computing power, Mu Zero scales with more computing power. Precisely because GPT-3 doesn’t scale, I’d expect an AGI to look more like Mu Zero and particularly with respect to the fact that it has some way of scaling.”
You need to scale computing power to actually train those parameters. If you just increase parameters but not petaflop-s/days according to the optimal scaling law relating compute/parameter, you will train the larger model too few steps, and it’ll end up with a worse (higher) loss than if you had trained a smaller model.
What I think Eliezer is referring to there is runtime scaling: AlphaZero/MuZero can play more games or plan out over its tree so you can dump arbitrarily more compute power into it to get arbitrarily better game player*; even AlphaFold can, to some degree, run through many iterations (because it just repeatedly refines an alignment), although I haven’t heard of this being particularly useful, so Eliezer is right to question whether AF2 is as good an example as AlphaZero/MuZero there.
Now, as it happens, I disagree there about GPT-3 being fixed in power once it’s trained. It’s true that there is no obvious way to use GPT-3 for rapid capability amplification by planning like MuZero. But that may reflect just the weirdness of trying to use natural language prompting & induced capabilities for planning, compared to models explicitly trained on planning tasks. This is why I keep saying: “sampling can prove the presence of knowledge, but not the absence”.
What we see with GPT-3 is that people keep finding better ways to tickle better decoding and planning out of it, showing that it always had those capabilities. Best-of sampling leads to an immediate large gain on completion quality or correctness. You can chain prompts.Codex/Copilot can use compilers/interpreters as oracles to test a large number of completions to find the most valid ones. LaMDA has a nice trick with requiring text style completions to use a weird token like ‘{’ and filtering output automatically. The use of step by step dialogue/explanation for an ‘inner monologue’ leads to large gains on math word problems in GPT-3, and gains in code generationin LaMDA. GPT-3 can summarize entire books recursively with a bit of RL finetuning, so working on hierarchical text summary/generation of very large text sequences is now possible. Foreign language translation can be boosted dramatically using self-distillation/critique in GPT-3, with no additional real data, as can symbolic knowledge graphs, one of the last bastions of GOFAI symbolic approaches… (Ought was experimenting with goal factoring and planning, I dunno how that’s going.) Many of these gains are large both relatively and absolutely, often putting the runtime model at SOTA or giving it a level of performance that would probably require scaling multiple magnitudes more before a single forward pass could compete.
So, GPT-3 may not be as easy to work with to dump runtime compute into like a game player, but I think there’s enough proofs-of-concept at this point your default expectation should be to expect future larger models to have even more runtime planning capabilities.
* although note that there is yet another scaling law tradeoff between training and runtime tree search/planning, where the former is more expensive but the latter scales worse, so if you are doing too much planning, it makes more sense to train the base model more instead. This is probably true of language models too given the ‘capability jumps’ in the scaling curves for tasks: at some point, trying to brute-force a task with GPT-3-1b hundreds or thousands of times is just not gonna work compared to running GPT-3-175b, which ‘gets it’ via meta-learning, once or a few times.
I thought GPT-3 was the canonical example of a model type that people are worried about will scale? (i.e. it’s discussed in https://www.gwern.net/Scaling-hypothesis?)
I think GPT was a model people expected to scale with the number of parameters, not computing power.
You need to scale computing power to actually train those parameters. If you just increase parameters but not petaflop-s/days according to the optimal scaling law relating compute/parameter, you will train the larger model too few steps, and it’ll end up with a worse (higher) loss than if you had trained a smaller model.
What I think Eliezer is referring to there is runtime scaling: AlphaZero/MuZero can play more games or plan out over its tree so you can dump arbitrarily more compute power into it to get arbitrarily better game player*; even AlphaFold can, to some degree, run through many iterations (because it just repeatedly refines an alignment), although I haven’t heard of this being particularly useful, so Eliezer is right to question whether AF2 is as good an example as AlphaZero/MuZero there.
Now, as it happens, I disagree there about GPT-3 being fixed in power once it’s trained. It’s true that there is no obvious way to use GPT-3 for rapid capability amplification by planning like MuZero. But that may reflect just the weirdness of trying to use natural language prompting & induced capabilities for planning, compared to models explicitly trained on planning tasks. This is why I keep saying: “sampling can prove the presence of knowledge, but not the absence”.
What we see with GPT-3 is that people keep finding better ways to tickle better decoding and planning out of it, showing that it always had those capabilities. Best-of sampling leads to an immediate large gain on completion quality or correctness. You can chain prompts. Codex/Copilot can use compilers/interpreters as oracles to test a large number of completions to find the most valid ones. LaMDA has a nice trick with requiring text style completions to use a weird token like ‘{’ and filtering output automatically. The use of step by step dialogue/explanation for an ‘inner monologue’ leads to large gains on math word problems in GPT-3, and gains in code generation in LaMDA. GPT-3 can summarize entire books recursively with a bit of RL finetuning, so working on hierarchical text summary/generation of very large text sequences is now possible. Foreign language translation can be boosted dramatically using self-distillation/critique in GPT-3, with no additional real data, as can symbolic knowledge graphs, one of the last bastions of GOFAI symbolic approaches… (Ought was experimenting with goal factoring and planning, I dunno how that’s going.) Many of these gains are large both relatively and absolutely, often putting the runtime model at SOTA or giving it a level of performance that would probably require scaling multiple magnitudes more before a single forward pass could compete.
So, GPT-3 may not be as easy to work with to dump runtime compute into like a game player, but I think there’s enough proofs-of-concept at this point your default expectation should be to expect future larger models to have even more runtime planning capabilities.
* although note that there is yet another scaling law tradeoff between training and runtime tree search/planning, where the former is more expensive but the latter scales worse, so if you are doing too much planning, it makes more sense to train the base model more instead. This is probably true of language models too given the ‘capability jumps’ in the scaling curves for tasks: at some point, trying to brute-force a task with GPT-3-1b hundreds or thousands of times is just not gonna work compared to running GPT-3-175b, which ‘gets it’ via meta-learning, once or a few times.