I have a hard time saying which of the scaling laws explanations I like better (I haven’t read either paper in detail, but I think I got the gist of both). What’s interesting about Hutter’s is that the model is so simple, and doesn’t require generalization at all. I feel like there’s a pretty strong Occam’s Razor-esque argument for preferring Hutter’s model, even though it seems wildly less intuitive to me. Or maybe what I want to say is more like “Hutter’s model DEMANDS refutation/falsification”.
I think both models also are very interesting for understanding DNN generaliztion… I really think it goes beyond memorization and local generalization (c.f. https://openreview.net/forum?id=rJv6ZgHYg), but it’s interesting that those are basically the mechanisms proposed by Hutter and Sharma & Kaplan (resp.)…
I feel like there’s a pretty strong Occam’s Razor-esque argument for preferring Hutter’s model, even though it seems wildly less intuitive to me.
?? Overall this claim feels to me like:
Observing that cows don’t float into space
Making a model of spherical cows with constant density ρ and showing that as long as ρ is more than density of air, the cows won’t float
Concluding that since the model is so simple, Occam’s Razor says that cows must be spherical with constant density.
Some ways that you could refute it:
It requires your data to be Zipf-distributed—why expect that to be true?
The simplicity comes from being further away from normal neural nets—surely the one that’s closer to neural nets is more likely to be true?
Or maybe what I want to say is more like “Hutter’s model DEMANDS refutation/falsification”.
Taken literally, this is easy to do. Neural nets often get the right answer on never-before-seen data points, whereas Hutter’s model doesn’t. Presumably you mean something else but idk what.
Intersting… Maybe this comes down to different taste or something. I understand, but don’t agree with, the cow analogy… I’m not sure why, but one difference is that I think we know more about cows than DNNs or something.
I haven’t thought about the Zipf-distributed thing.
> Taken literally, this is easy to do. Neural nets often get the right answer on never-before-seen data points, whereas Hutter’s model doesn’t. Presumably you mean something else but idk what.
I’d like to see Hutter’s model “translated” a bit to DNNs, e.g. by assuming they get anything right that’s within epsilon of a training data poing or something… maybe it even ends up looking like the other model in that context…
I’d like to see Hutter’s model “translated” a bit to DNNs, e.g. by assuming they get anything right that’s within epsilon of a training data poing or something
With this assumption, asymptotically (i.e. with enough data) this becomes a nearest neighbor classifier. For the d-dimensional manifold assumption in the other model, you can apply the arguments from the other model to say that you scale as D−c/d for some constant c (probably c = 1 or 2, depending on what exactly we’re quantifying the scaling of).
I’m not entirely sure how you’d generalize the Zipf assumption to the “within epsilon” case, since in the original model there was no assumption on the smoothness of the function being predicted (i.e. [0, 0, 0] and [0, 0, 0.000001] could have completely different values.)
I have a hard time saying which of the scaling laws explanations I like better (I haven’t read either paper in detail, but I think I got the gist of both).
What’s interesting about Hutter’s is that the model is so simple, and doesn’t require generalization at all.
I feel like there’s a pretty strong Occam’s Razor-esque argument for preferring Hutter’s model, even though it seems wildly less intuitive to me.
Or maybe what I want to say is more like “Hutter’s model DEMANDS refutation/falsification”.
I think both models also are very interesting for understanding DNN generaliztion… I really think it goes beyond memorization and local generalization (c.f. https://openreview.net/forum?id=rJv6ZgHYg), but it’s interesting that those are basically the mechanisms proposed by Hutter and Sharma & Kaplan (resp.)…
?? Overall this claim feels to me like:
Observing that cows don’t float into space
Making a model of spherical cows with constant density ρ and showing that as long as ρ is more than density of air, the cows won’t float
Concluding that since the model is so simple, Occam’s Razor says that cows must be spherical with constant density.
Some ways that you could refute it:
It requires your data to be Zipf-distributed—why expect that to be true?
The simplicity comes from being further away from normal neural nets—surely the one that’s closer to neural nets is more likely to be true?
Taken literally, this is easy to do. Neural nets often get the right answer on never-before-seen data points, whereas Hutter’s model doesn’t. Presumably you mean something else but idk what.
Intersting… Maybe this comes down to different taste or something. I understand, but don’t agree with, the cow analogy… I’m not sure why, but one difference is that I think we know more about cows than DNNs or something.
I haven’t thought about the Zipf-distributed thing.
> Taken literally, this is easy to do. Neural nets often get the right answer on never-before-seen data points, whereas Hutter’s model doesn’t. Presumably you mean something else but idk what.
I’d like to see Hutter’s model “translated” a bit to DNNs, e.g. by assuming they get anything right that’s within epsilon of a training data poing or something… maybe it even ends up looking like the other model in that context…
With this assumption, asymptotically (i.e. with enough data) this becomes a nearest neighbor classifier. For the d-dimensional manifold assumption in the other model, you can apply the arguments from the other model to say that you scale as D−c/d for some constant c (probably c = 1 or 2, depending on what exactly we’re quantifying the scaling of).
I’m not entirely sure how you’d generalize the Zipf assumption to the “within epsilon” case, since in the original model there was no assumption on the smoothness of the function being predicted (i.e. [0, 0, 0] and [0, 0, 0.000001] could have completely different values.)