adamShimi comments on Updating the Lottery Ticket Hypothesis

adamShimi 20 Apr 2021 14:49 UTC
LW: 6 AF: 5
AF
The main empirical finding which led to the NTK/GP/Mingard et al picture of neural nets is that, in practice, that linear approximation works quite well. As neural networks get large, their parameters change by only a very small amount during training, so the overall $Δ θ$ found during training is actually nearly a solution to the linearly-approximated equations.
Trying to check if I’m understanding correctly: does that mean that despite SGD doing a lot of successive changes that use the gradient at the successive parameter values, these “even out” such that they end up equivalent to a single update from the initial parameters?
- johnswentworth 20 Apr 2021 15:19 UTC
  LW: 6 AF: 5
  AF Parent
  Sort of. They end up equivalent to a single Newton step, not a single gradient step (or at least that’s what this model says). In general, a set of linear equations is not solved by one gradient step, but is solved by one Newton step. It generally takes many gradient steps to solve a set of linear equations.
  (Caveat to this: if you directly attempt a Newton step on this sort of system, you’ll probably get an error, because the system is underdetermined. Actually making Newton steps work for NN training would probably be a huge pain in the ass, since the underdetermination would cause numerical issues.)
  - adamShimi 20 Apr 2021 17:24 UTC
    LW: 5 AF: 4
    AF Parent
    By Newton step, do you mean one step of Newton’s method?
    - johnswentworth 20 Apr 2021 17:26 UTC
      LW: 2 AF: 2
      AF Parent
      Yes.