Couldn’t it just be that the intercept has been extrapolated wrongly, perhaps due to misspecification on the lower end of the scaling law?
Or I guess often people combine multiple scaling laws to get optimal performance as a function of compute. That introduces a lot of complexity and I’m not sure where that puts us as to realistic errors.
Well, I suppose it could be misspecification, but if there were some sort of misestimation of the intercept itself (despite the scaling law fits usually being eerily exact), is there some reason it would usually be in the direction of underestimating the intercept badly enough that we could actually be near hitting perfect performance and the divergence become noticeable? Seems like it could just as easily overestimate it and produce spuriously good looking performance as later models ‘overperform’.
Couldn’t it just be that the intercept has been extrapolated wrongly, perhaps due to misspecification on the lower end of the scaling law?
Or I guess often people combine multiple scaling laws to get optimal performance as a function of compute. That introduces a lot of complexity and I’m not sure where that puts us as to realistic errors.
Well, I suppose it could be misspecification, but if there were some sort of misestimation of the intercept itself (despite the scaling law fits usually being eerily exact), is there some reason it would usually be in the direction of underestimating the intercept badly enough that we could actually be near hitting perfect performance and the divergence become noticeable? Seems like it could just as easily overestimate it and produce spuriously good looking performance as later models ‘overperform’.
I suppose that is logical enough.