Vivek Hebbar comments on Information Loss --> Basin flatness

Vivek Hebbar 26 May 2022 22:48 UTC
LW: 1 AF: 1
AF
I will split this into a math reply, and a reply about the big picture / info loss interpretation.
Math reply:
Thanks for fleshing out the calculus rigorously; admittedly, I had not done this. Rather, I simply assumed MSE loss and proceeded largely through visual intuition.
I agree that assuming MSE, and looking at a local minimum, you have $r a n k (H e s s (L)) = r a n k (J f)$
This is still false! Edit: I am now confused, I don’t know if it is false or not.
You are conflating $\nabla_{f} l (f (θ))$ and $\nabla_{θ} l (f (θ))$ . Adding disambiguation, we have:
$\nabla_{θ} L (θ) = (\nabla_{f} l (f (θ))) J_{θ} f (θ)$
$H e s s_{θ} (L) (θ) = J_{θ} f (θ)^{T} [H e s s_{f} (l) (f (θ))] J_{θ} f (θ) + \nabla_{f} l (f (θ)) D_{θ}^{2} f (θ)$
So we see that the second term disappears if $\nabla_{f} l (f (θ)) = 0$ . But the critical point condition is $\nabla_{θ} l (f (θ)) = 0$ . From chain rule, we have:
$\nabla_{θ} l (f (θ)) = (\nabla_{f} l (f (θ))) J_{θ} f (θ)$
So it is possible to have a local minimum where $\nabla_{f} l (f (θ)) \neq 0$ , if $\nabla_{f} l (f (θ))$ is in the left null-space of $J_{θ} f (θ)$ . There is a nice qualitative interpretation as well, but I don’t have energy/time to explain it.
However, if we are at a perfect-behavior global minimum of a regression task, then $\nabla_{f} l (f (θ))$ is definitely zero.
A few points about rank equality at a perfect-behavior global min:
1. $r a n k (H e s s (L)) = r a n k (J f)$ holds as long as $H e s s (l) (f (θ))$ is a diagonal matrix. It need not be a multiple of the identity.
2. Hence, rank equality holds anytime the loss is a sum of functions s.t. each function only looks at a single component of the behavior.
3. If the network output is 1d (as assumed in the post), this just means that the loss is a sum over losses on individual inputs.
4. We can extend to larger outputs by having the behavior $f$ be the flattened concatenation of outputs. The rank equality condition is still satisfied for MSE, Binary Cross Entropy, and Cross Entropy over a probability vector. It is not satisfied if we consider the behavior to be raw logits (before the softmax) and softmax+CrossEntropy as the loss function. But we can easily fix that by considering probability (after softmax) as behavior instead of raw logits.
- carboniferous_umbraculum 27 May 2022 8:29 UTC
  LW: 3 AF: 1
  AF Parent
  Thanks again for the reply.
  
  In my notation, something like $\nabla l$ or $J f$ are functions in and of themselves. The function $\nabla l$ evaluates to zero at local minima of $l$ .
  
  In my notation, there isn’t any such thing as $\nabla_{f} l$ .
  
  But look, I think that this is perhaps getting a little too bogged down for me to want to try to neatly resolve in the comment section, and I expect to be away from work for the next few days so may not check back for a while. Personally, I would just recommend going back and slowly going through the mathematical details again, checking every step at the lowest level of detail that you can and using the notation that makes most sense to you.