And then we have L(xtrain)=0 and also ∇L(xtrain)=0. (This does assume a “sufficiently nice” loss function, that is satisfied by most loss functions used in practice.)
I agree ∇L(xtest) is not zero. I also agree if you take some point in between xtrain and xtest it can have non-zero loss, e.g.L(xtrain+xtest2) need not be zero. I’m not sure if either of these are what you’re trying to say, but in any case they aren’t relevant to the quoted sentence.
If you are claiming ∇L(xtrain)≠0, then I disagree and am unclear on how your arguments are supposed to establish that.
Sorry if that wasn’t clear. As for how to establish that, I proposed an intuitive justification:
There is no mechanism fitting the model to the linear approximation of the data around the training points.
And an outline for a proof:
Take two problems which have the same value at the training points but with wildly different linear terms around them. A model perfectly fit to the training points would not be able to distinguish the two.
Let’s walk through an example
Consider trying to fit a simple function, f(x)=0 . Let’s collect a training dataset
Dtrain=[(x=0,y=0),(x=1,y=0),(x=2,y=0),…]
You optimize a perfect model on Dtrain (e.g. a neural net with mapping NN(x)=0)
Now let’s study the scaling of error as you move away from training points. In the example, we achieved ∇xL(xtrain)=0, since coincidentally ∀xf(x)=NN(x)
2. Consider a second example. Let’s fit f(x)=sin(x/π). Again, we collect training data
Dtrain=[(x=0,y=0),(x=1,y=0),(x=2,y=0),…]
You optimize a perfect model on Dtrain (using the same optimization procedure, we get a neural net with mapping NN(x)=0)
Now, we see ∇xL(xtrain)≠0 (we predict a flat line at y=0, and L measures error from a sinusoid). You can notice this visually or analytically:
The model is trained on xtrain,f(xtrain), and is independent of ∇xf(xtrain). That means even if by happy accident our optimization procedure achieves ∇xL(xtrain)=0, we can prove that it is not generally true by considering an identical training dataset with a different underlying function (and knowing our optimization must result in the same model)
On rereading your original argument:
since the loss is minimized, the gradient is also zero.
I think this is referring to ∇θL(xtrain)=0, which is certainly true for a perfectly optimized model (or even just settled gradient descent). Maybe that’s where the miscommunication is stemming from, since “gradient of loss” is being overloaded from discussion of optimization ∇θL, and discussion of Taylor-expanding L around xtrain (which uses ∇xL)
I think this is referring to ∇θL(xtrain)=0, which is certainly true for a perfectly optimized model (or even just settled gradient descent). Maybe that’s where the miscommunication is stemming from
Ah, yup, that’s the issue, and I agree you’re correct that ∇xL is the relevant thing here. I’ll edit the post to say I’m no longer sure about the claim. (I don’t have the time to understand how this lines up with the actual paper—I remember it being kind of sparse and not trivial to follow—perhaps you could look into it and leave a comment here.)
I think this little mistake doesn’t affect the gist of your summary post, I wouldn’t worry about it too much.
The mistake
The mistaken argument was an attempt to explain why α≥2/d. Believe it or not, the paper never actually argued α≥2/d, it argued α≥1/d. That’s because the paper used the L2 loss, and α≥2/d(L2)⟺α≥1/d(L1).
I feel that was the main misunderstanding.
Figure 2b in the paper shows α=4/d(L2) for most models but α=2/d(L2) for some models.
The paper never mentions L2 loss, just that the loss function is “analytic in f−F and minimized at f=F.” Such a loss function converges to L2 when the loss is small enough. This important sentence is hard to read because it’s cut in half by a bunch of graphs, and looks like another unimportant mathematical assumption.
Some people like L2 loss or a loss function that converges to L2 when the loss is small, because most loss functions (even L1) behave like L2 anyway, once you subtract a big source of loss. E.g. variance-limited scaling has α=1 in both L1 and L2 because you subtract L(∞,P) or L(D,∞) from L(D,P). Even resolution-limited scaling may require subtracting loss due to noise L(∞,∞). L2 is nicer because if L(xtrain) is zero, ∇xL2(xtrain)=0 but ∇xL1(xtrain) is undefined since the absolute value is undifferentiable at 0.
If you read closely, α≥4/d(L2) only refers to piecewise linear approximations:
L∝D−n/d, at large D. Note that if the model provides an accurate piecewise linear approximation, we will generically find n≥4.
The second paper says the same:
If the model is piecewise linear instead of piecewise constant and f(x) is smooth with bounded derivatives, then the deviation |f(x)−c(x)|∝s2, and so the L2 loss will scale[4] as s4. We would predict L(N)∝1N4/d
Nearest neighbors regression/classification has α=1/d(L1). Skip this section if you already agree:
If you use the nearest neighbors regression to fit the function f(x)=sin(x/π), it creates a piecewise constant function that looks like a staircase trying to approximate a curve. The L1 loss is proportional to D^(-1) because adding data makes the staircase smaller. A better attempt to connect the dots of the training data (e.g. a piecewise linear function) would have L1 loss proportional to D^(-2) or better because adding data makes the “staircase” both smaller and smoother. Both the distances and the loss gradient decrease.
My guess is that nearest neighbors classification also has L1 loss proportional to D^(-1/d), because the volume that is misclassified is roughly proportional to distance between the decision boundary and the nearest point. Again, I think it creates a decision boundary that is rough (analogous to the staircase), and the roughness gets smaller with additional data but never gets any smoother.
If you want a quick fix, you might change the following:
Under the assumption that our test loss is sufficiently “nice”, we can do a Taylor expansion of the test loss around this nearest training data point and take just the first non-zero term. Since we have perfectly fit the training data, at the training data point, the loss is zero; and since the loss is minimized, the gradient is also zero. Thus, the first non-zero term is the second-order term, which is proportional to the square of the distance. So, we expect that our scaling law will look like kD^(-2/d), that is, α = 2/d. (EDIT: I no longer endorse the above argument, see the comments.)
The above case assumes that our model learns a piecewise constant function. However, neural nets with Relu activations learn piecewise linear functions. For this case, we can argue that since the neural network is interpolating linearly between the training points, any deviation of the distance between the true value and the actual value should scale as D^(-2/d) instead of D^(-1/d), since the linear term is being approximated by the neural network. In this case, for loss functions like the L2 loss, which are quadratic in the distance, we get that α = 4/d.
Note that it is possible that scaling could be even faster, e.g. because the underlying manifold is simple or has some nice structure that the neural network can quickly capture. So in general, we might expect α >= 2/d and for L2 loss α >= 4/d.
becomes
Under the assumption that fmodel−f is sufficiently “nice”, we can do a Taylor expansion of it around this nearest training data point and take just the first non-zero term. Since we have perfectly fit the training data, the difference is zero at the training data point. With a constant term of zero, we use the linear term (gradient ⋅ displacement), which is proportional to the distance. So, we expect that our scaling law will look like kD^(-1/d). For loss functions like the L2 loss, which scale as the difference squared, it becomes kD^(-2/d) so α = 2/d.
The above case assumes that our model learns a piecewise constant function. However, neural nets with Relu activations learn piecewise linear functions. For this case, we can argue that since the neural network is interpolating linearly between the training points, any deviation of the difference between the true value and the model’s value should scale as D^(-2/d) instead of D^(-1/d), since the linear term is being approximated by the neural network (the linear term difference also decreases when the distance decreases). For L2 loss, we get that α = 4/d.
Note that it is possible that scaling could be even faster, e.g. because the underlying manifold is simple or has some nice structure that the neural network can quickly capture. So in general, we might expect α >= 2/d.
Also change
Once again, we make the assumption that the learned model gives a piecewise linear approximation, which by the same argument suggests a scaling law of X^(-α), with α >= 2/d (and α >= 4/d for the case of L2 loss)
and checking whether α >= 4/d. In most cases, they find that it is quite close to equality. In the case of language modeling with GPT, they find that α is significantly larger than 4/d, which is still in accordance with the equality (though it is still relatively small—language models just have a high intrinsic dimension)
to
Once again, we make the assumption that the learned model is better than a piecewise constant approximation, which by the same argument suggests a scaling law of X^(-α), with α >= 2/d (and α >= 4/d for a piecewise linear approximation)
and checking whether α >= 2/d. In most cases, they find that it is quite close to 4/d. In the case of language modeling with GPT, they find that α is significantly larger than 4/d, which is still in accordance with α >= 2/d (though α is still relatively small—language models just have a high intrinsic dimension)
Conclusion
Once AI scientists make a false conclusion like α≥2/d(L1) or ∇xL1(xtrain)=0, they may hallucinate arguments which justify the conclusion. A future research direction is to investigate whether large language models learned this behavior from the AI scientists who trained them.
I’m so sorry I’m so immature but I can’t help it.
Overall, it’s not a big mistake because it doesn’t invalidate the gist of the summary. It’s very subtle, unlike those glaring mistake that I’ve seen in works by other people… and myself. :)
Mathematically, the Taylor expansion is:
L(xtest)=L(xtrain)+(xtest−xtrain)⋅∇L(xtrain)+12(xtest−xtrain)THL(xtrain)(xtest−xtrain)+...
And then we have L(xtrain)=0 and also ∇L(xtrain)=0. (This does assume a “sufficiently nice” loss function, that is satisfied by most loss functions used in practice.)
I agree ∇L(xtest) is not zero. I also agree if you take some point in between xtrain and xtest it can have non-zero loss, e.g.L(xtrain+xtest2) need not be zero. I’m not sure if either of these are what you’re trying to say, but in any case they aren’t relevant to the quoted sentence.
If you are claiming ∇L(xtrain)≠0, then I disagree and am unclear on how your arguments are supposed to establish that.
Yes, that’s precisely what I’m claiming!
Sorry if that wasn’t clear. As for how to establish that, I proposed an intuitive justification:
And an outline for a proof:
Let’s walk through an example
Dtrain=[(x=0,y=0),(x=1,y=0),(x=2,y=0),…]Consider trying to fit a simple function, f(x)=0 . Let’s collect a training dataset
You optimize a perfect model on Dtrain (e.g. a neural net with mapping NN(x)=0)
Now let’s study the scaling of error as you move away from training points. In the example, we achieved ∇xL(xtrain)=0, since coincidentally ∀xf(x)=NN(x)
2. Consider a second example. Let’s fit f(x)=sin(x/π). Again, we collect training data
Dtrain=[(x=0,y=0),(x=1,y=0),(x=2,y=0),…]
derivation: ∇xL(xtrain)=∇x∥f(x)−NN(x)∥x=xtrain=∇x∥f(x)−0∥x=xtrain=∇x∥sin(x/π)∥x∈Z≠0You optimize a perfect model on Dtrain (using the same optimization procedure, we get a neural net with mapping NN(x)=0)
Now, we see ∇xL(xtrain)≠0 (we predict a flat line at y=0, and L measures error from a sinusoid). You can notice this visually or analytically:
The model is trained on xtrain,f(xtrain), and is independent of ∇xf(xtrain). That means even if by happy accident our optimization procedure achieves ∇xL(xtrain)=0, we can prove that it is not generally true by considering an identical training dataset with a different underlying function (and knowing our optimization must result in the same model)
On rereading your original argument:
I think this is referring to ∇θL(xtrain)=0, which is certainly true for a perfectly optimized model (or even just settled gradient descent). Maybe that’s where the miscommunication is stemming from, since “gradient of loss” is being overloaded from discussion of optimization ∇θL, and discussion of Taylor-expanding L around xtrain (which uses ∇xL)
Ah, yup, that’s the issue, and I agree you’re correct that ∇xL is the relevant thing here. I’ll edit the post to say I’m no longer sure about the claim. (I don’t have the time to understand how this lines up with the actual paper—I remember it being kind of sparse and not trivial to follow—perhaps you could look into it and leave a comment here.)
I think this little mistake doesn’t affect the gist of your summary post, I wouldn’t worry about it too much.
The mistake
The mistaken argument was an attempt to explain why α≥2/d. Believe it or not, the paper never actually argued α≥2/d, it argued α≥1/d. That’s because the paper used the L2 loss, and α≥2/d (L2)⟺α≥1/d (L1).
I feel that was the main misunderstanding.
Figure 2b in the paper shows α=4/d (L2) for most models but α=2/d (L2) for some models.
The paper never mentions L2 loss, just that the loss function is “analytic in f−F and minimized at f=F.” Such a loss function converges to L2 when the loss is small enough. This important sentence is hard to read because it’s cut in half by a bunch of graphs, and looks like another unimportant mathematical assumption.
Some people like L2 loss or a loss function that converges to L2 when the loss is small, because most loss functions (even L1) behave like L2 anyway, once you subtract a big source of loss. E.g. variance-limited scaling has α=1 in both L1 and L2 because you subtract L(∞,P) or L(D,∞) from L(D,P). Even resolution-limited scaling may require subtracting loss due to noise L(∞,∞). L2 is nicer because if L(xtrain) is zero, ∇xL2(xtrain)=0 but ∇xL1(xtrain) is undefined since the absolute value is undifferentiable at 0.
If you read closely, α≥4/d (L2) only refers to piecewise linear approximations:
The second paper says the same:
Nearest neighbors regression/classification has α=1/d (L1). Skip this section if you already agree:
If you use the nearest neighbors regression to fit the function f(x)=sin(x/π), it creates a piecewise constant function that looks like a staircase trying to approximate a curve. The L1 loss is proportional to D^(-1) because adding data makes the staircase smaller. A better attempt to connect the dots of the training data (e.g. a piecewise linear function) would have L1 loss proportional to D^(-2) or better because adding data makes the “staircase” both smaller and smoother. Both the distances and the loss gradient decrease.
My guess is that nearest neighbors classification also has L1 loss proportional to D^(-1/d), because the volume that is misclassified is roughly proportional to distance between the decision boundary and the nearest point. Again, I think it creates a decision boundary that is rough (analogous to the staircase), and the roughness gets smaller with additional data but never gets any smoother.
Note the rough decision boundaries in a picture of nearest neighbors classification:
https://upload.wikimedia.org/wikipedia/commons/5/52/Map1NN.png
Suggestions
If you want a quick fix, you might change the following:
becomes
Also change
to
Conclusion
Once AI scientists make a false conclusion like α≥2/d (L1) or ∇xL1(xtrain)=0, they may hallucinate arguments which justify the conclusion. A future research direction is to investigate whether large language models learned this behavior from the AI scientists who trained them.
I’m so sorry I’m so immature but I can’t help it.
Overall, it’s not a big mistake because it doesn’t invalidate the gist of the summary. It’s very subtle, unlike those glaring mistake that I’ve seen in works by other people… and myself. :)