This paper seems to be arguing that variance initially increases as network width goes up, then starts decreasing for very large networks, suggesting that overall variance is likely to decrease as we approach more advanced AI systems and networks get very large.
‘Variance’ is used in an amusing number of ways in these discussions.You use ‘variance’ in one sense (the bias-variance tradeoff), but “Explaining Neural Scaling Laws”, Bahri et al 2021 talks about a difference kind of variance limit in scaling, while “Learning Curve Theory”, Hutter 2001′s toy model provides statements on yet others kinds of variances about scaling curves themselves (and I think you could easily dig up a paper from the neural tangent kernel people about scaling approximating infinite width models which only need to make infinitesimally small linear updates or something like that because variance in a different sense goes down...) Meanwhile, my original observation was about the difficulty of connecting benchmarks to practical real-world capabilities: regardless of whether the ‘variance of increases in practical real-world capabilities’ goes up or down with additional scaling, we still have no good way to say that an X% increase on benchmarks ought to yield qualitatively new capability Y—almost a year later, still no one has shown how you would have predicted in advance that pushing GPT-3 to a particular likelihood loss would yield all these cool new things. As we cannot predict that at all, it would not be of terribly much use to say whether it either increases or decreases as we continue scaling (since either way, we may wind up being surprised).
This paper seems to be arguing that variance initially increases as network width goes up, then starts decreasing for very large networks, suggesting that overall variance is likely to decrease as we approach more advanced AI systems and networks get very large.
‘Variance’ is used in an amusing number of ways in these discussions.You use ‘variance’ in one sense (the bias-variance tradeoff), but “Explaining Neural Scaling Laws”, Bahri et al 2021 talks about a difference kind of variance limit in scaling, while “Learning Curve Theory”, Hutter 2001′s toy model provides statements on yet others kinds of variances about scaling curves themselves (and I think you could easily dig up a paper from the neural tangent kernel people about scaling approximating infinite width models which only need to make infinitesimally small linear updates or something like that because variance in a different sense goes down...) Meanwhile, my original observation was about the difficulty of connecting benchmarks to practical real-world capabilities: regardless of whether the ‘variance of increases in practical real-world capabilities’ goes up or down with additional scaling, we still have no good way to say that an X% increase on benchmarks ought to yield qualitatively new capability Y—almost a year later, still no one has shown how you would have predicted in advance that pushing GPT-3 to a particular likelihood loss would yield all these cool new things. As we cannot predict that at all, it would not be of terribly much use to say whether it either increases or decreases as we continue scaling (since either way, we may wind up being surprised).