Are the straight lines from scaling laws really bending? People are saying they are, but maybe that’s just an artefact of the fact that the cross-entropy is bounded below by the data entropy. If you subtract the data entropy, then you obtain the Kullback-Leibler divergence, which is bounded by zero, and so in a log-log plot, it can actually approach negative infinity. I visualized this with the help of ChatGPT:
Here, f represents the Kullback-Leibler divergence, and g the cross-entropy loss with the entropy offset.
It is a thing that I remember having been said at podcasts, but I don’t remember which one, and there is a chance that it was never said in the sense I interpreted it.
“DeepMind says that at large quantities of compute the scaling laws bend slightly, and the optimal behavior might be to scale data by even more than you scale model size. In which case you might need to increase compute by more than 200x before it would make sense to use a trillion parameters.”
That was quite a while ago, and is not a very strongly worded claim. I think there was also evidence that Chinchilla got a constant factor wrong and people kept discovering that you wanted a substantially larger multiplier of data:parameter, which might fully account for any ‘slight bending’ back then—bending often just means you got a hyperparameter wrong and need to tune it better. (It’s a lot easier to break scaling than to improve it, so being away badly is not too interesting while bending the opposite direction is much more interesting.)
Isn’t an intercept offset already usually included in the scaling laws and so can’t be misleading anyone? I didn’t think anyone was fitting scaling laws which allow going to exactly 0 with no intrinsic entropy.
Couldn’t it just be that the intercept has been extrapolated wrongly, perhaps due to misspecification on the lower end of the scaling law?
Or I guess often people combine multiple scaling laws to get optimal performance as a function of compute. That introduces a lot of complexity and I’m not sure where that puts us as to realistic errors.
Well, I suppose it could be misspecification, but if there were some sort of misestimation of the intercept itself (despite the scaling law fits usually being eerily exact), is there some reason it would usually be in the direction of underestimating the intercept badly enough that we could actually be near hitting perfect performance and the divergence become noticeable? Seems like it could just as easily overestimate it and produce spuriously good looking performance as later models ‘overperform’.
Are the straight lines from scaling laws really bending? People are saying they are, but maybe that’s just an artefact of the fact that the cross-entropy is bounded below by the data entropy. If you subtract the data entropy, then you obtain the Kullback-Leibler divergence, which is bounded by zero, and so in a log-log plot, it can actually approach negative infinity. I visualized this with the help of ChatGPT:
Here, f represents the Kullback-Leibler divergence, and g the cross-entropy loss with the entropy offset.
I’ve not seen the claim that the scaling laws are bending. Where should I look?
It is a thing that I remember having been said at podcasts, but I don’t remember which one, and there is a chance that it was never said in the sense I interpreted it.
Also, quote from this post:
“DeepMind says that at large quantities of compute the scaling laws bend slightly, and the optimal behavior might be to scale data by even more than you scale model size. In which case you might need to increase compute by more than 200x before it would make sense to use a trillion parameters.”
That was quite a while ago, and is not a very strongly worded claim. I think there was also evidence that Chinchilla got a constant factor wrong and people kept discovering that you wanted a substantially larger multiplier of data:parameter, which might fully account for any ‘slight bending’ back then—bending often just means you got a hyperparameter wrong and need to tune it better. (It’s a lot easier to break scaling than to improve it, so being away badly is not too interesting while bending the opposite direction is much more interesting.)
Isn’t an intercept offset already usually included in the scaling laws and so can’t be misleading anyone? I didn’t think anyone was fitting scaling laws which allow going to exactly 0 with no intrinsic entropy.
Couldn’t it just be that the intercept has been extrapolated wrongly, perhaps due to misspecification on the lower end of the scaling law?
Or I guess often people combine multiple scaling laws to get optimal performance as a function of compute. That introduces a lot of complexity and I’m not sure where that puts us as to realistic errors.
Well, I suppose it could be misspecification, but if there were some sort of misestimation of the intercept itself (despite the scaling law fits usually being eerily exact), is there some reason it would usually be in the direction of underestimating the intercept badly enough that we could actually be near hitting perfect performance and the divergence become noticeable? Seems like it could just as easily overestimate it and produce spuriously good looking performance as later models ‘overperform’.
I suppose that is logical enough.