The LCA paper (to be summarized in AN #98) presents a method for understanding the contribution of specific updates to specific parameters to the overall loss. The basic idea is to decompose the overall change in training loss across training iterations:
L(θT)−L(θ0)=∑tL(θt)−L(θt−1)
And then to decompose training loss across specific parameters:
L(θt)−L(θt−1)=→θt∫→θt−1→∇→θL(→θ)⋅d→θ
I’ve added vector arrows to emphasize that θ is a vector and that we are taking a dot product. This is a path integral, but since gradients form a conservative field, we can choose any arbitrary path. We’ll be choosing the linear path throughout. We can rewrite the integral as the dot product of the change in parameters and the average gradient:
L(θt)−L(θt−1)=(θt−θt−1)⋅Averagett−1(∇L(θ)).
(This is pretty standard, but I’ve included a derivation at the end.)
Since this is a dot product, it decomposes into a sum over the individual parameters:
So, for an individual parameter, and an individual training step, we can define the contribution to the change in loss as A(i)t=(θ(i)t−θ(i)t−1)Averagett−1(∇L(θ))(i)
So based on this, I’m going to define my own version of LCA, called LCANaive. Suppose the gradient computed at training iteration t is Gt (which is a vector). LCANaive uses the approximation Averagett−1(∇L(θ))≈Gt−1, giving A(i)t,Naive=(θ(i)t−θ(i)t−1)G(i)t−1 . But the SGD update is given by θ(i)t=θ(i)t−1−αG(i)t−1 (where α is the learning rate), which implies that A(i)t,Naive=(−αG(i)t−1)G(i)t−1=−α(G(i)t−1)2, which is always negative, i.e. it predicts that every parameter always learns in every iteration. This isn’t surprising—we decomposed the improvement in training into the movement of parameters along the gradient direction, but moving along the gradient direction is exactly what we do to train!
Yet, the experiments in the paper sometimes show positive LCAs. What’s up with that? There are a few differences between LCANaive and the actual method used in the paper:
1. The training method is sometimes Adam or Momentum-SGD, instead of regular SGD.
2.LCANaive approximates the average gradient with the training gradient, which is only calculated on a minibatch of data. LCA uses the loss on the full training dataset.
3.LCANaive uses a point estimate of the gradient and assumes it is the average, which is like a first-order / linear Taylor approximation (which gets worse the larger your learning rate / step size is). LCA proper uses multiple estimates between θt and θt−1 to reduce the approximation error.
I think those are the only differences (though it’s always hard to tell if there’s some unmentioned detail that creates another difference), which means that whenever the paper says “these parameters had positive LCA”, that effect can be attributed to some combination of the above 3 factors.
----
Derivation of turning the path integral into a dot product with an average:
The LCA paper (to be summarized in AN #98) presents a method for understanding the contribution of specific updates to specific parameters to the overall loss. The basic idea is to decompose the overall change in training loss across training iterations:
L(θT)−L(θ0)=∑tL(θt)−L(θt−1)
And then to decompose training loss across specific parameters:
L(θt)−L(θt−1)=→θt∫→θt−1→∇→θL(→θ)⋅d→θ
I’ve added vector arrows to emphasize that θ is a vector and that we are taking a dot product. This is a path integral, but since gradients form a conservative field, we can choose any arbitrary path. We’ll be choosing the linear path throughout. We can rewrite the integral as the dot product of the change in parameters and the average gradient:
L(θt)−L(θt−1)=(θt−θt−1)⋅Averagett−1(∇L(θ)).
(This is pretty standard, but I’ve included a derivation at the end.)
Since this is a dot product, it decomposes into a sum over the individual parameters:
L(θt)−L(θt−1)=∑i(θ(i)t−θ(i)t−1)Averagett−1(∇L(θ))(i)
So, for an individual parameter, and an individual training step, we can define the contribution to the change in loss as A(i)t=(θ(i)t−θ(i)t−1)Averagett−1(∇L(θ))(i)
So based on this, I’m going to define my own version of LCA, called LCANaive. Suppose the gradient computed at training iteration t is Gt (which is a vector). LCANaive uses the approximation Averagett−1(∇L(θ))≈Gt−1, giving A(i)t,Naive=(θ(i)t−θ(i)t−1)G(i)t−1 . But the SGD update is given by θ(i)t=θ(i)t−1−αG(i)t−1 (where α is the learning rate), which implies that A(i)t,Naive=(−αG(i)t−1)G(i)t−1=−α(G(i)t−1)2, which is always negative, i.e. it predicts that every parameter always learns in every iteration. This isn’t surprising—we decomposed the improvement in training into the movement of parameters along the gradient direction, but moving along the gradient direction is exactly what we do to train!
Yet, the experiments in the paper sometimes show positive LCAs. What’s up with that? There are a few differences between LCANaive and the actual method used in the paper:
1. The training method is sometimes Adam or Momentum-SGD, instead of regular SGD.
2.LCANaive approximates the average gradient with the training gradient, which is only calculated on a minibatch of data. LCA uses the loss on the full training dataset.
3.LCANaive uses a point estimate of the gradient and assumes it is the average, which is like a first-order / linear Taylor approximation (which gets worse the larger your learning rate / step size is). LCA proper uses multiple estimates between θt and θt−1 to reduce the approximation error.
I think those are the only differences (though it’s always hard to tell if there’s some unmentioned detail that creates another difference), which means that whenever the paper says “these parameters had positive LCA”, that effect can be attributed to some combination of the above 3 factors.
----
Derivation of turning the path integral into a dot product with an average:
L(θt)−L(θt−1)=limn→∞n−1∑k=0(∇L(θt−1+kΔθ)⋅Δθ)where Δθ=1n(θt−θt−1)
=limn→∞nΔθ⋅(1nn−1∑k=0∇L(θt−1+kΔθ))
=limn→∞(θt−θt−1)⋅(1nn−1∑k=0∇L(θt−1+kΔθ))
=(θt−θt−1)⋅Averagett−1(∇L(θ)) , where the average is defined as limn→∞(1nn−1∑k=0∇L(θt−1+kΔθ)) .