johnswentworth comments on Updating the Lottery Ticket Hypothesis

johnswentworth 23 Apr 2021 19:47 UTC
5 points
Ok, I should walk through this, there’s multiple people confused about it.
The training process effectively solves the set of (linear) equations
$y^{(n)} = f (x^{(n)}, θ_{0}) + Δ θ \cdot \frac{d f}{d θ} (x^{(n)}, θ_{0})$
There’s one equation for each data point, and one unknown for each dimension of $θ$ .
Key point: solving sets of equations is not what gradient descent does. Gradient descent minimizes a scalar function. In order for gradient descent to solve the problem, we need to transform “solve the set of equations” into “minimize a scalar function”. Typically, we do that by choosing an objective like “minimize the sum of squared errors” or something along those lines—e.g. in this case we’d probably use:
$o b j (Δ θ) = \sum_{n} (y^{(n)} - f (x^{(n)}, θ_{0}) + Δ θ \cdot \frac{d f}{d θ} (x^{(n)}, θ_{0}))^{2}$
This is a quadratic function of $Δ θ$ (i.e. it’s second-order in $Δ θ$ ). If we compute the gradient of this function, then we’ll find that it does not point toward the minimum of the function—the gradient only uses first-order (i.e. linear) information. The gradient does not point toward the solution of the original system of equations. Thus: gradient descent will not solve the set of equations in one step.
In order to solve the set of equations in one step, we could use a second-order minimization method like Newton’s (gradient descent is first-order). Key concept: minimizing an objective function means solving the set of equations $\nabla o b j = 0$ ; minimization is equivalent to system-solving with one more derivative. So, a first-order method for solving a system is equivalent to a second-order method for minimizing an objective (both are Newton’s method).
Does this make more sense?
- p.b. 24 Apr 2021 14:37 UTC
  1 point
  Parent
  Yes, definitely, thank you!
  Though I was originally confused on a much more basic level, due to superficial reading, jumping to conclusions and not having touched much calculus notation in the last 15 years.