Rough thoughts on how to derive a neural scaling law. I haven’t looked at any papers on this in years and only have vague memories of “data manifold dimension” playing an important role in the derivation Kaplan told me about in a talk.
How do you predict neural scaling laws? Maybe assume that reality is such that it outputs distributions which are intricately detailed and reward ever more sophisticated models.
Perhaps an example of such a distribution would be a good idea? Like, maybe some chaotic systems are like this.
Then you say that you know this stuff about the data manifold, then try and prove similar theorems about the kinds of models that describe the manifold. You could have some really artificial assumption which just says that models of manifolds follow some scaling law or whatever. But perhaps you can relax things a bit and make some assumptions about how NNs work, e.g. they’re “just interpolating” and see how that affects things? Perhaps that would get you a scaling law related to the dimensionality of the manifold. E.g. for a d dimensional manifold, C times more compute leads to C1/d increase in precision??? Then somehow relate that to e.g. next word token prediction or something.
You need to give more info on the metric of the models, and details on what the model is doing, in order to turn this C1/d estimate into something that looks like a standard scaling law.
Rough thoughts on how to derive a neural scaling law. I haven’t looked at any papers on this in years and only have vague memories of “data manifold dimension” playing an important role in the derivation Kaplan told me about in a talk.
How do you predict neural scaling laws? Maybe assume that reality is such that it outputs distributions which are intricately detailed and reward ever more sophisticated models.
Perhaps an example of such a distribution would be a good idea? Like, maybe some chaotic systems are like this.
Then you say that you know this stuff about the data manifold, then try and prove similar theorems about the kinds of models that describe the manifold. You could have some really artificial assumption which just says that models of manifolds follow some scaling law or whatever. But perhaps you can relax things a bit and make some assumptions about how NNs work, e.g. they’re “just interpolating” and see how that affects things? Perhaps that would get you a scaling law related to the dimensionality of the manifold. E.g. for a d dimensional manifold, C times more compute leads to C1/d increase in precision??? Then somehow relate that to e.g. next word token prediction or something.
You need to give more info on the metric of the models, and details on what the model is doing, in order to turn this C1/d estimate into something that looks like a standard scaling law.