Osher Lerner

Karma: 5

Osher Lerner Sep 30, 2024, 10:00 PM
1 point
0
in reply to: Stephen Welch’s comment on: [AN #140]: Theoretical models that predict scaling laws
Oh hi! I linked your video in another comment without noticing this one. Great visual explanation!

Osher Lerner Sep 30, 2024, 9:50 PM
4 points
1
in reply to: Rohin Shah’s comment on: [AN #140]: Theoretical models that predict scaling laws
Yes, that’s precisely what I’m claiming!
Sorry if that wasn’t clear. As for how to establish that, I proposed an intuitive justification:
There is no mechanism fitting the model to the linear approximation of the data around the training points.
And an outline for a proof:
Take two problems which have the same value at the training points but with wildly different linear terms around them. A model perfectly fit to the training points would not be able to distinguish the two.
Let’s walk through an example
1. Consider trying to fit a simple function, $f (x) = 0$ . Let’s collect a training dataset
$D_{train} = [(x = 0, y = 0), (x = 1, y = 0), (x = 2, y = 0), \dots]$
- You optimize a perfect model on $D_{train}$ (e.g. a neural net with mapping $N N (x) = 0$ )
- Now let’s study the scaling of error as you move away from training points. In the example, we achieved $\nabla_{x} L (x_{train}) = 0$ , since coincidentally $\forall x f (x) = N N (x)$
2. Consider a second example. Let’s fit $f (x) = sin (x / π)$ . Again, we collect training data
$D_{train} = [(x = 0, y = 0), (x = 1, y = 0), (x = 2, y = 0), \dots]$
- You optimize a perfect model on $D_{train}$ (using the same optimization procedure, we get a neural net with mapping $N N (x) = 0$ )
- Now, we see $\nabla_{x} L (x_{train}) \neq 0$ (we predict a flat line at $y = 0$ , and $L$ measures error from a sinusoid). You can notice this visually or analytically:
$derivation: \nabla_{x} L (x_{train}) = \nabla_{x} {∥ f (x) - N N (x) ∥}_{x = x_{train}} = {\nabla_{x} ∥ f (x) - 0 ∥}_{x = x_{train}} = \nabla_{x} ∥ sin (x / π) ∥_{x \in Z} \neq 0$
The model is trained on $x_{train}, f (x_{train})$ , and is independent of $\nabla_{x} f (x_{train})$ . That means even if by happy accident our optimization procedure achieves $\nabla_{x} L (x_{train}) = 0$ , we can prove that it is not generally true by considering an identical training dataset with a different underlying function (and knowing our optimization must result in the same model)

On rereading your original argument:
since the loss is minimized, the gradient is also zero.
I think this is referring to $\nabla_{θ} L (x_{train}) = 0$ , which is certainly true for a perfectly optimized model (or even just settled gradient descent). Maybe that’s where the miscommunication is stemming from, since “gradient of loss” is being overloaded from discussion of optimization $\nabla_{θ} L$ , and discussion of Taylor-expanding $L$ around $x_{train}$ (which uses $\nabla_{x} L$ )

Osher Lerner Sep 24, 2024, 8:12 AM
3 points
0
on: [AN #140]: Theoretical models that predict scaling laws
Since we have perfectly fit the training data, at the training data point, the loss is zero; and since the loss is minimized, the gradient is also zero.

The linear term in this is not actually 0. There is no mechanism fitting the model to the linear approximation of the data around the training points. The model is only fit to the (0th order) value at the training points.
To correctly state the above sentence:
- “perfectly fit the training data” $⟺$ “the loss is 0 (at training points)” $\Rightarrow$ “(training) loss is minimized” $⇏$ “gradient (of test loss around a training point) is 0”
To prove this point:
- Take two problems which have the same value at the training points but with wildly different linear terms around them. A model perfectly fit to the training points would not be able to distinguish the two.
To see this visually:
- Check out this plot, which quotes the above sentence from this post as a footnote, but brilliantly visually demonstrates the contradiction: see how error increases linearly around training points (it’s using interpolation between points, but the same point holds for a constant piecewise function)