Ok, I should walk through this, there’s multiple people confused about it.
The training process effectively solves the set of (linear) equations
y(n)=f(x(n),θ0)+Δθ⋅dfdθ(x(n),θ0)
There’s one equation for each data point, and one unknown for each dimension of θ.
Key point: solving sets of equations is not what gradient descent does. Gradient descent minimizes a scalar function. In order for gradient descent to solve the problem, we need to transform “solve the set of equations” into “minimize a scalar function”. Typically, we do that by choosing an objective like “minimize the sum of squared errors” or something along those lines—e.g. in this case we’d probably use:
obj(Δθ)=∑n(y(n)−f(x(n),θ0)+Δθ⋅dfdθ(x(n),θ0))2
This is a quadratic function of Δθ (i.e. it’s second-order in Δθ). If we compute the gradient of this function, then we’ll find that it does not point toward the minimum of the function—the gradient only uses first-order (i.e. linear) information. The gradient does not point toward the solution of the original system of equations. Thus: gradient descent will not solve the set of equations in one step.
In order to solve the set of equations in one step, we could use a second-order minimization method like Newton’s (gradient descent is first-order). Key concept: minimizing an objective function means solving the set of equations ∇obj=0; minimization is equivalent to system-solving with one more derivative. So, a first-order method for solving a system is equivalent to a second-order method for minimizing an objective (both are Newton’s method).
Though I was originally confused on a much more basic level, due to superficial reading, jumping to conclusions and not having touched much calculus notation in the last 15 years.
Ah, I guess I understand now. I was always thinking about an updating of the parameters. But you are talking about adding to the function output.
Ok, I should walk through this, there’s multiple people confused about it.
The training process effectively solves the set of (linear) equations
y(n)=f(x(n),θ0)+Δθ⋅dfdθ(x(n),θ0)
There’s one equation for each data point, and one unknown for each dimension of θ.
Key point: solving sets of equations is not what gradient descent does. Gradient descent minimizes a scalar function. In order for gradient descent to solve the problem, we need to transform “solve the set of equations” into “minimize a scalar function”. Typically, we do that by choosing an objective like “minimize the sum of squared errors” or something along those lines—e.g. in this case we’d probably use:
obj(Δθ)=∑n(y(n)−f(x(n),θ0)+Δθ⋅dfdθ(x(n),θ0))2
This is a quadratic function of Δθ (i.e. it’s second-order in Δθ). If we compute the gradient of this function, then we’ll find that it does not point toward the minimum of the function—the gradient only uses first-order (i.e. linear) information. The gradient does not point toward the solution of the original system of equations. Thus: gradient descent will not solve the set of equations in one step.
In order to solve the set of equations in one step, we could use a second-order minimization method like Newton’s (gradient descent is first-order). Key concept: minimizing an objective function means solving the set of equations ∇obj=0; minimization is equivalent to system-solving with one more derivative. So, a first-order method for solving a system is equivalent to a second-order method for minimizing an objective (both are Newton’s method).
Does this make more sense?
Yes, definitely, thank you!
Though I was originally confused on a much more basic level, due to superficial reading, jumping to conclusions and not having touched much calculus notation in the last 15 years.