Maybe I am misunderstanding something, but I don’t think the parameter tangent hypothesis can be generally correct.
Let’s say we have 1 datapoint A to be mapped to −1 and 100 datapoint B to be mapped to +1. The model is randomly initialised. Now the parameter tangent space is just the current model + the gradient over the dataset * delta. The gradient over the entire dataset is the sum of the gradients for each datapoint. Therefore the gradient points a hundred times more towards the solution for the input B than towards the solution for input A. If we search solely in the tangent space, we will either solve B and get a miniscule improvement for A or solve A and massively overshoot the best parameters for B.
I.e. to reach the final parameters with a single update, the computed gradient has to be balanced between all possible skills the model is supposed to be exhibiting after training. Otherwise the gradient will not point towards a global-ish minimum but a frequency-of-problem-weighted minimum.
SGD solves this problem because the gradient for B shrinks, the more updates into the direction of B-solution have been made. So in our examples the gradient for B would shrink to zero during training and A would get its time in the sun.
Inutitively, the parameter tangent space would be correct for MNIST and other well balanced small datasets. But for large language models to pick a random example, it is not clear what „well balanced“ in the above sense even means.
I think you’re confusing one-update-of-gradient-descent with one-update-of-Newton’s-method. I.e.:
to reach the final parameters with a single update, the computed gradient has to be balanced between...
In general, a single gradient-descent update is not sufficient to solve a system of linear equations. (Your reasoning for why that is is basically correct, though the usual explanation is more general.) Indeed, SGD does not converge in one step in practice.
One step of Newton’s method is sufficient to solve a system of linear equations, but that’s equivalent to many steps of gradient descent.
No, when I say single update, I just mean that the final model can in principle be reached by a single update with the initial gradient. I’m aware that in practice you need more steps to compute the correct delta.
My argument is solely about the initial gradient. It does not point to the minimum SGD would reach, because the initial gradient tries harder to solve common problems, but the SGD-minimum (ideally) solves even rare problems. SGD manages to do this because common problems do not influence later gradients, because they will already be solved.
No, when I say single update,I just mean that the final model can in principle be reached by a single update with the initial gradient.
The final model cannot be reached by a single update with the initial gradient, because in a system of linear equations (i.e. objective is squared error in the system, or anything along those lines), the gradient does not point straight toward the solution. It’s not just a question of computing the correct delta; the initial gradient doesn’t even point in the right direction.
Ok, I should walk through this, there’s multiple people confused about it.
The training process effectively solves the set of (linear) equations
y(n)=f(x(n),θ0)+Δθ⋅dfdθ(x(n),θ0)
There’s one equation for each data point, and one unknown for each dimension of θ.
Key point: solving sets of equations is not what gradient descent does. Gradient descent minimizes a scalar function. In order for gradient descent to solve the problem, we need to transform “solve the set of equations” into “minimize a scalar function”. Typically, we do that by choosing an objective like “minimize the sum of squared errors” or something along those lines—e.g. in this case we’d probably use:
obj(Δθ)=∑n(y(n)−f(x(n),θ0)+Δθ⋅dfdθ(x(n),θ0))2
This is a quadratic function of Δθ (i.e. it’s second-order in Δθ). If we compute the gradient of this function, then we’ll find that it does not point toward the minimum of the function—the gradient only uses first-order (i.e. linear) information. The gradient does not point toward the solution of the original system of equations. Thus: gradient descent will not solve the set of equations in one step.
In order to solve the set of equations in one step, we could use a second-order minimization method like Newton’s (gradient descent is first-order). Key concept: minimizing an objective function means solving the set of equations ∇obj=0; minimization is equivalent to system-solving with one more derivative. So, a first-order method for solving a system is equivalent to a second-order method for minimizing an objective (both are Newton’s method).
Though I was originally confused on a much more basic level, due to superficial reading, jumping to conclusions and not having touched much calculus notation in the last 15 years.
Maybe I am misunderstanding something, but I don’t think the parameter tangent hypothesis can be generally correct.
Let’s say we have 1 datapoint A to be mapped to −1 and 100 datapoint B to be mapped to +1. The model is randomly initialised. Now the parameter tangent space is just the current model + the gradient over the dataset * delta. The gradient over the entire dataset is the sum of the gradients for each datapoint. Therefore the gradient points a hundred times more towards the solution for the input B than towards the solution for input A. If we search solely in the tangent space, we will either solve B and get a miniscule improvement for A or solve A and massively overshoot the best parameters for B.
I.e. to reach the final parameters with a single update, the computed gradient has to be balanced between all possible skills the model is supposed to be exhibiting after training. Otherwise the gradient will not point towards a global-ish minimum but a frequency-of-problem-weighted minimum.
SGD solves this problem because the gradient for B shrinks, the more updates into the direction of B-solution have been made. So in our examples the gradient for B would shrink to zero during training and A would get its time in the sun.
Inutitively, the parameter tangent space would be correct for MNIST and other well balanced small datasets. But for large language models to pick a random example, it is not clear what „well balanced“ in the above sense even means.
I think you’re confusing one-update-of-gradient-descent with one-update-of-Newton’s-method. I.e.:
In general, a single gradient-descent update is not sufficient to solve a system of linear equations. (Your reasoning for why that is is basically correct, though the usual explanation is more general.) Indeed, SGD does not converge in one step in practice.
One step of Newton’s method is sufficient to solve a system of linear equations, but that’s equivalent to many steps of gradient descent.
No, when I say single update, I just mean that the final model can in principle be reached by a single update with the initial gradient. I’m aware that in practice you need more steps to compute the correct delta.
My argument is solely about the initial gradient. It does not point to the minimum SGD would reach, because the initial gradient tries harder to solve common problems, but the SGD-minimum (ideally) solves even rare problems. SGD manages to do this because common problems do not influence later gradients, because they will already be solved.
The final model cannot be reached by a single update with the initial gradient, because in a system of linear equations (i.e. objective is squared error in the system, or anything along those lines), the gradient does not point straight toward the solution. It’s not just a question of computing the correct delta; the initial gradient doesn’t even point in the right direction.
Ok, I thought your F(x) was one update step of the gradient of f times ΔΘ away from f. I guess then I just don’t understand the equation.
Ah, I guess I understand now. I was always thinking about an updating of the parameters. But you are talking about adding to the function output.
Ok, I should walk through this, there’s multiple people confused about it.
The training process effectively solves the set of (linear) equations
y(n)=f(x(n),θ0)+Δθ⋅dfdθ(x(n),θ0)
There’s one equation for each data point, and one unknown for each dimension of θ.
Key point: solving sets of equations is not what gradient descent does. Gradient descent minimizes a scalar function. In order for gradient descent to solve the problem, we need to transform “solve the set of equations” into “minimize a scalar function”. Typically, we do that by choosing an objective like “minimize the sum of squared errors” or something along those lines—e.g. in this case we’d probably use:
obj(Δθ)=∑n(y(n)−f(x(n),θ0)+Δθ⋅dfdθ(x(n),θ0))2
This is a quadratic function of Δθ (i.e. it’s second-order in Δθ). If we compute the gradient of this function, then we’ll find that it does not point toward the minimum of the function—the gradient only uses first-order (i.e. linear) information. The gradient does not point toward the solution of the original system of equations. Thus: gradient descent will not solve the set of equations in one step.
In order to solve the set of equations in one step, we could use a second-order minimization method like Newton’s (gradient descent is first-order). Key concept: minimizing an objective function means solving the set of equations ∇obj=0; minimization is equivalent to system-solving with one more derivative. So, a first-order method for solving a system is equivalent to a second-order method for minimizing an objective (both are Newton’s method).
Does this make more sense?
Yes, definitely, thank you!
Though I was originally confused on a much more basic level, due to superficial reading, jumping to conclusions and not having touched much calculus notation in the last 15 years.