Rohin Shah comments on rohinmshah’s Shortform

Rohin Shah 3 May 2020 21:12 UTC
LW: 4 AF: 3
AF
The LCA paper (to be summarized in AN #98) presents a method for understanding the contribution of specific updates to specific parameters to the overall loss. The basic idea is to decompose the overall change in training loss across training iterations:
$L (θ_{T}) - L (θ_{0}) = \sum_{t} L (θ_{t}) - L (θ_{t - 1})$
And then to decompose training loss across specific parameters:
$L (θ_{t}) - L (θ_{t - 1}) = {\to θ}_{t} \int {\to θ}_{t - 1} {\to \nabla}_{\to θ} L (\to θ) \cdot d \to θ$
I’ve added vector arrows to emphasize that $θ$ is a vector and that we are taking a dot product. This is a path integral, but since gradients form a conservative field, we can choose any arbitrary path. We’ll be choosing the linear path throughout. We can rewrite the integral as the dot product of the change in parameters and the average gradient:
$L (θ_{t}) - L (θ_{t - 1}) = (θ_{t} - θ_{t - 1}) \cdot {Average}_{t - 1}^{t} (\nabla L (θ))$ .
(This is pretty standard, but I’ve included a derivation at the end.)
Since this is a dot product, it decomposes into a sum over the individual parameters:
$L (θ_{t}) - L (θ_{t - 1}) = \sum i (θ_{t}^{(i)} - θ_{t - 1}^{(i)}) {Average}_{t - 1}^{t} (\nabla L (θ))^{(i)}$
So, for an individual parameter, and an individual training step, we can define the contribution to the change in loss as $A_{t}^{(i)} = (θ_{t}^{(i)} - θ_{t - 1}^{(i)}) {Average}_{t - 1}^{t} (\nabla L (θ))^{(i)}$
So based on this, I’m going to define my own version of LCA, called $L C A_{N a i v e}$ . Suppose the gradient computed at training iteration $t$ is $G_{t}$ (which is a vector). $L C A_{N a i v e}$ uses the approximation ${Average}_{t - 1}^{t} (\nabla L (θ)) \approx G_{t - 1}$ , giving $A_{t, N a i v e}^{(i)} = (θ_{t}^{(i)} - θ_{t - 1}^{(i)}) G_{t - 1}^{(i)}$ . But the SGD update is given by $θ_{t}^{(i)} = θ_{t - 1}^{(i)} - α G_{t - 1}^{(i)}$ (where $α$ is the learning rate), which implies that $A_{t, N a i v e}^{(i)} = (- α G_{t - 1}^{(i)}) G_{t - 1}^{(i)} = - α (G_{t - 1}^{(i)})^{2}$ , which is always negative, i.e. it predicts that every parameter always learns in every iteration. This isn’t surprising—we decomposed the improvement in training into the movement of parameters along the gradient direction, but moving along the gradient direction is exactly what we do to train!
Yet, the experiments in the paper sometimes show positive LCAs. What’s up with that? There are a few differences between $L C A_{N a i v e}$ and the actual method used in the paper:
1. The training method is sometimes Adam or Momentum-SGD, instead of regular SGD.
2. $L C A_{N a i v e}$ approximates the average gradient with the training gradient, which is only calculated on a minibatch of data. LCA uses the loss on the full training dataset.
3. $L C A_{N a i v e}$ uses a point estimate of the gradient and assumes it is the average, which is like a first-order / linear Taylor approximation (which gets worse the larger your learning rate / step size is). LCA proper uses multiple estimates between $θ_{t}$ and $θ_{t - 1}$ to reduce the approximation error.
I think those are the only differences (though it’s always hard to tell if there’s some unmentioned detail that creates another difference), which means that whenever the paper says “these parameters had positive LCA”, that effect can be attributed to some combination of the above 3 factors.
----
Derivation of turning the path integral into a dot product with an average:
$L (θ_{t}) - L (θ_{t - 1}) = lim n \to \infty n - 1 \sum k = 0 (\nabla L (θ_{t - 1} + k Δ θ) \cdot Δ θ)$ where $Δ θ = \frac{1}{n} (θ_{t} - θ_{t - 1})$
$= lim n \to \infty n Δ θ \cdot (\frac{1}{n} n - 1 \sum k = 0 \nabla L (θ_{t - 1} + k Δ θ))$
$= lim n \to \infty (θ_{t} - θ_{t - 1}) \cdot (\frac{1}{n} n - 1 \sum k = 0 \nabla L (θ_{t - 1} + k Δ θ))$
$= (θ_{t} - θ_{t - 1}) \cdot {Average}_{t - 1}^{t} (\nabla L (θ))$ , where the average is defined as $lim n \to \infty (\frac{1}{n} n - 1 \sum k = 0 \nabla L (θ_{t - 1} + k Δ θ))$ .
What links here?
- [AN #98]: Understanding neural net training by seeing which gradients were helpful by Rohin Shah (6 May 2020 17:10 UTC; 22 points)