[4] Lukas Finnveden points out that Gwern’s extrapolation is pretty weird. Quoting Lukas: “Gwern takes GPT-3′s current performance on lambada; assumes that the loss will fall as fast as it does on “predict-the-next-word” (despite the fact that the lambada loss is currently falling much faster!) and extrapolates current performance (without adjusting for the expected change in scaling law after the crossover point) until the point where the AI is as good as humans (and btw we don’t have a source for the stated human performance)
I’d endorse a summary more like “If progress carries on as it has so far, we might just need ~1e27 FLOP to get to mturk-level of errors on the benchmarks closest to GPT-3′s native predict-the-next-word game. Even if progress on these benchmarks slowed down and improved at the same rate as GPT-3′s generic word-prediction abilities, we’d expect it to happen at ~1e30 FLOP for the lambada benchmark.”
All that being said, Lukas’ own extrapolation seems to confirm the general impression that GPT’s performance will reach human-level around the same time its size reaches brain-size: “Given that Cotra’s model’s median number of parameters is close to my best guess of where near-optimal performance is achieved, the extrapolations do not contradict the model’s estimates, and constitute some evidence for the median being roughly right.”↩︎
[4] Lukas Finnveden points out that Gwern’s extrapolation is pretty weird. Quoting Lukas: “Gwern takes GPT-3′s current performance on lambada; assumes that the loss will fall as fast as it does on “predict-the-next-word” (despite the fact that the lambada loss is currently falling much faster!) and extrapolates current performance (without adjusting for the expected change in scaling law after the crossover point) until the point where the AI is as good as humans (and btw we don’t have a source for the stated human performance)
I’d endorse a summary more like “If progress carries on as it has so far, we might just need ~1e27 FLOP to get to mturk-level of errors on the benchmarks closest to GPT-3′s native predict-the-next-word game. Even if progress on these benchmarks slowed down and improved at the same rate as GPT-3′s generic word-prediction abilities, we’d expect it to happen at ~1e30 FLOP for the lambada benchmark.”
All that being said, Lukas’ own extrapolation seems to confirm the general impression that GPT’s performance will reach human-level around the same time its size reaches brain-size: “Given that Cotra’s model’s median number of parameters is close to my best guess of where near-optimal performance is achieved, the extrapolations do not contradict the model’s estimates, and constitute some evidence for the median being roughly right.”↩︎