I think the best explanations for it is a combo of 1 and 2. Specifically, I believe that the more intelligent behaviors only emerge in the last few bits of training, and thus scaling laws underestimate how valuable the later bits are. In other words the long tail bites hard, where the last few bits contain nearly all the intelligence.
Another explanation from myself on how normally distributed intelligence gives rise to big differences:
My point is that while intelligence is well approximated by a normal distribution (Not perfectly, and there may even be mild log-normal distributions), the others aren’t well approximated by a normal distribution at all, which means that the controlling variable of intelligence has very small variance, but the variables that are controlled have very large deviations ala power laws are very heavy tailed log-normals, thus the distribution has very high variance, often multiple orders of magnitude variance large.
Another explanation from myself:
There’s also log-normal/power law distributions, where the majority of tasks have a heavy tail, that is the extreme outliers perform way better than the average, so this takes care of why small differences in general intelligence can lead to large differences in impact.
I think the best explanations for it is a combo of 1 and 2. Specifically, I believe that the more intelligent behaviors only emerge in the last few bits of training, and thus scaling laws underestimate how valuable the later bits are. In other words the long tail bites hard, where the last few bits contain nearly all the intelligence.
More on this from Gwern here:
https://www.gwern.net/Scaling-hypothesis#why-does-pretraining-work
Another explanation from myself on how normally distributed intelligence gives rise to big differences:
Another explanation from myself: