My basic take is that there will be lots of empirical examples where increasing model size by a factor of 100 leads to nonlinear increases in capabilities (and perhaps to qualitative changes in behavior). On median, I’d guess we’ll see at least 2 such examples in 2022 and at least 100 by 2030.
At the point where there’s a “FOOM”, such examples will be commonplace and happening all the time. Foom will look like one particularly large phase transition (maybe 99th percentile among examples so far) that chains into more and more. It seems possible (though not certain—maybe 33%?) that once you have the right phase transition to kick off the rest, everything else happens pretty quickly (within a few days).
Is this take more consistent with Paul’s or Eliezer’s? I’m not totally sure. I’d guess closer to Paul’s, but maybe the “1 day” world is consistent with Eliezer’s?
(One candidate for the “big” phase transition would be if the model figures out how to go off and learn on its own, so that number of SGD updates is no longer the primary bottleneck on model capabilities. But I could also imagine us getting that even when models are still fairly “dumb”.)
My basic take is that there will be lots of empirical examples where increasing model size by a factor of 100 leads to nonlinear increases in capabilities (and perhaps to qualitative changes in behavior). On median, I’d guess we’ll see at least 2 such examples in 2022 and at least 100 by 2030.
At the point where there’s a “FOOM”, such examples will be commonplace and happening all the time. Foom will look like one particularly large phase transition (maybe 99th percentile among examples so far) that chains into more and more. It seems possible (though not certain—maybe 33%?) that once you have the right phase transition to kick off the rest, everything else happens pretty quickly (within a few days).
Is this take more consistent with Paul’s or Eliezer’s? I’m not totally sure. I’d guess closer to Paul’s, but maybe the “1 day” world is consistent with Eliezer’s?
(One candidate for the “big” phase transition would be if the model figures out how to go off and learn on its own, so that number of SGD updates is no longer the primary bottleneck on model capabilities. But I could also imagine us getting that even when models are still fairly “dumb”.)