Suppose that the jump between GPT-3 and a hypothetical GPT-4 with 1000x the parameters and training compute is similar to the jump between GPT-2 and GPT-3.
This assumption seems absurd to me; per Chinchilla scaling laws, optimal training of a (dense) 100T parameter model will cost > 2.2Mx Gopher’s cost and require > 2 quadrillion tokens.
We won’t be seeing optimally trained hundreds of trillions dense parameter models anytime soon.
If we place “average human intelligence” at the level of GPT-3 (or the similar sized open source BLOOM model), then such an AGI can currently be bought for $120k.
Likewise, I don’t understand this assumption either. GPT-3 is superhuman in some aspects and subhuman in others; I don’t think it “averages out” to median human level in general.
We won’t be seeing optimally trained hundreds of trillions dense parameter models anytime soon.
We won’t be seeing them Chinchilla-trained, of course, but that’s a completely different claim. Chinchilla scaling is obviously suboptimal compared to something better, just like all scaling laws before it have been. And they’ve only gone one direction: down.
This assumption seems absurd to me; per Chinchilla scaling laws, optimal training of a (dense) 100T parameter model will cost > 2.2Mx Gopher’s cost and require > 2 quadrillion tokens.
We won’t be seeing optimally trained hundreds of trillions dense parameter models anytime soon.
Likewise, I don’t understand this assumption either. GPT-3 is superhuman in some aspects and subhuman in others; I don’t think it “averages out” to median human level in general.
We won’t be seeing them Chinchilla-trained, of course, but that’s a completely different claim. Chinchilla scaling is obviously suboptimal compared to something better, just like all scaling laws before it have been. And they’ve only gone one direction: down.