Yes, that’s precisely what I’m claiming!
Sorry if that wasn’t clear. As for how to establish that, I proposed an intuitive justification:
There is no mechanism fitting the model to the linear approximation of the data around the training points.
And an outline for a proof:
Take two problems which have the same value at the training points but with wildly different linear terms around them. A model perfectly fit to the training points would not be able to distinguish the two.
Let’s walk through an example
Consider trying to fit a simple function, . Let’s collect a training dataset
You optimize a perfect model on (e.g. a neural net with mapping )
Now let’s study the scaling of error as you move away from training points. In the example, we achieved , since coincidentally
2. Consider a second example. Let’s fit . Again, we collect training data
You optimize a perfect model on (using the same optimization procedure, we get a neural net with mapping )
Now, we see (we predict a flat line at , and measures error from a sinusoid). You can notice this visually or analytically:
The model is trained on , and is independent of . That means even if by happy accident our optimization procedure achieves , we can prove that it is not generally true by considering an identical training dataset with a different underlying function (and knowing our optimization must result in the same model)
On rereading your original argument:
since the loss is minimized, the gradient is also zero.
I think this is referring to , which is certainly true for a perfectly optimized model (or even just settled gradient descent). Maybe that’s where the miscommunication is stemming from, since “gradient of loss” is being overloaded from discussion of optimization , and discussion of Taylor-expanding around (which uses )
Oh hi! I linked your video in another comment without noticing this one. Great visual explanation!