carboniferous_umbraculum comments on Information Loss --> Basin flatness

carboniferous_umbraculum 26 May 2022 12:58 UTC
LW: 3 AF: 2
AF
Thanks for the substantive reply.

First some more specific/detailed comments: Regarding the relationship with the loss and with the Hessian of the loss, my concern sort of stems from the fact that the domains/codomains are different and so I think it deserves to be spelled out. The loss of a model with parameters $θ \in Θ$ can be described by introducing the actual function that maps the behavior to the real numbers, right? i.e. given some actual function $l : O^{k} \to R$ we have:
$L : Θ f ⟶ O^{k} l ⟶ R$
i.e. it’s $l$ that might be something like MSE, but the function $L$ ″ is of course more mysterious because it includes the way that parameters are actually mapped to a working model. Anyway, to perform some computations with this, we are looking at an expression like
$L (θ) = l (f (θ))$
We want to differentiate this twice with respect to $θ$ essentially. Firstly, we have
$\nabla L (θ) = \nabla l (f (θ)) J f (θ)$
where—just to keep track of this—we’ve got:
$(1 \times N) vector = [(1 \times k) vector] [(k \times N) matrix]$
Or, using ‘coordinates’ to make it explicit:
$\frac{\partial}{\partial θ_{i}} L (θ) = \nabla l (f (θ)) \cdot \frac{\partial f}{\partial θ_{i}} = k \sum p = 1 \nabla^{p} l (f (θ)) \cdot \frac{\partial f^{p}}{\partial θ_{i}}$
for $i = 1, \dots, N$ . Then for $j = 1, \dots, N$ we differentiate again:
$\frac{\partial^{2}}{\partial θ_{j} \partial θ_{i}} L (θ) = k \sum p = 1 k \sum q = 1 \nabla^{q} \nabla^{p} l (f (θ)) \frac{\partial f^{q}}{\partial θ_{j}} \frac{\partial f^{p}}{\partial θ_{i}} + k \sum p = 1 \nabla^{p} l (f (θ)) \frac{\partial f^{p}}{\partial θ_{j} \partial θ_{i}}$
Or,
$H e s s (L) (θ) = J f (θ)^{T} [H e s s (l) (f (θ))] J f (θ) + \nabla l (f (θ)) D^{2} f (θ)$
This is now at the level of $(N \times N)$ matrices. Avoiding getting into any depth about tensors and indices, the $D^{2} f$ term is basically a $(N \times N \times k)$ tensor-type object and it’s paired with $\nabla l$ which is a $(1 \times k)$ vector to give something that is $(N \times N)$ .
So what I think you are saying now is that if we are at a local minimum for $l$ , then the second term on the right-hand side vanishes (because the term includes the first derivatives of $l$ , which are zero at a minimum). You can see however that if the Hessian of $l$ is not a multiple of the identity (like it would be for MSE), then the claimed relationship does not hold, i.e. it is not the case that in general, at a minima of $l$ , the Hessian of the loss is equal to a constant times $(J f)^{T} J f$ . So maybe you really do want to explicitly assume something like MSE.

I agree that assuming MSE, and looking at a local minimum, you have $r a n k (H e s s (L)) = r a n k (J f)$ .
(In case it’s of interest to anyone, googling turned up this recent paper https://openreview.net/forum?id=otDgw7LM7Nn which studies pretty much exactly the problem of bounding the rank of the Hessian of the loss. They say: “Flatness: A growing number of works [59–61] correlate the choice of regularizers, optimizers, or hyperparameters, with the additional flatness brought about by them at the minimum. However, the significant rank degeneracy of the Hessian, which we have provably established, also points to another source of flatness — that exists as a virtue of the compositional model structure —from the initialization itself. Thus, a prospective avenue of future work would be to compare different architectures based on this inherent kind of flatness.”)

Some broader remarks: I think these are nice observations but unfortunately I think generally I’m a bit confused/unclear about what else you might get out of going along these lines. I don’t want to sound harsh but just trying to be plain: This is mostly because, as we can see, the mathematical part of what you have said is all very simple, well-established facts about smooth functions and so it would be surprising (to me at least) if some non-trivial observation about deep learning came out from it. In a similar vein, regarding the “cause” of low-rank G, I do think that one could try to bring in a notion of “information loss” in neural networks, but for it to be substantive one needs to be careful that it’s not simply a rephrasing of what it means for the Jacobian to have less-than-full rank. Being a bit loose/informal now: To illustrate, just imagine for a moment a real-valued function on an interval. I could say it ‘loses information’ where its values cannot distinguish between a subset of points. But this is almost the same as just saying: It is constant on some subset...which is of course very close to just saying the derivative vanishes on some subset. Here, if you describe the phenomena of information loss as concretely as being the situation where some inputs can’t be distinguished, then (particularly given that you have to assume these spaces are actually some kind of smooth/differentiable spaces to do the theoretical analysis), you’ve more or less just built into your description of information loss something that looks a lot like the function being constant along some directions, which means there is a vector in the kernel of the Jacobian. I don’t think it’s somehow incorrect to point to this but it becomes more like just saying ‘perhaps one useful definition of information loss is low rank G’ as opposed to linking one phenomenon to the other.

Sorry for the very long remarks. Of course this is actually because I found it well worth engaging with. And I have a longer-standing personal interest in zero sets of smooth functions!
What links here?
- carboniferous_umbraculum 's comment on Basin broadness depends on the size and number of orthogonal features by CallumMcDougall (29 Aug 2022 12:36 UTC; 4 points)
- Vivek Hebbar 26 May 2022 22:48 UTC
  LW: 1 AF: 1
  AF Parent
  I will split this into a math reply, and a reply about the big picture / info loss interpretation.
  Math reply:
  Thanks for fleshing out the calculus rigorously; admittedly, I had not done this. Rather, I simply assumed MSE loss and proceeded largely through visual intuition.
  I agree that assuming MSE, and looking at a local minimum, you have $r a n k (H e s s (L)) = r a n k (J f)$
  This is still false! Edit: I am now confused, I don’t know if it is false or not.
  You are conflating $\nabla_{f} l (f (θ))$ and $\nabla_{θ} l (f (θ))$ . Adding disambiguation, we have:
  $\nabla_{θ} L (θ) = (\nabla_{f} l (f (θ))) J_{θ} f (θ)$
  $H e s s_{θ} (L) (θ) = J_{θ} f (θ)^{T} [H e s s_{f} (l) (f (θ))] J_{θ} f (θ) + \nabla_{f} l (f (θ)) D_{θ}^{2} f (θ)$
  So we see that the second term disappears if $\nabla_{f} l (f (θ)) = 0$ . But the critical point condition is $\nabla_{θ} l (f (θ)) = 0$ . From chain rule, we have:
  $\nabla_{θ} l (f (θ)) = (\nabla_{f} l (f (θ))) J_{θ} f (θ)$
  So it is possible to have a local minimum where $\nabla_{f} l (f (θ)) \neq 0$ , if $\nabla_{f} l (f (θ))$ is in the left null-space of $J_{θ} f (θ)$ . There is a nice qualitative interpretation as well, but I don’t have energy/time to explain it.
  However, if we are at a perfect-behavior global minimum of a regression task, then $\nabla_{f} l (f (θ))$ is definitely zero.
  A few points about rank equality at a perfect-behavior global min:
  1. $r a n k (H e s s (L)) = r a n k (J f)$ holds as long as $H e s s (l) (f (θ))$ is a diagonal matrix. It need not be a multiple of the identity.
  2. Hence, rank equality holds anytime the loss is a sum of functions s.t. each function only looks at a single component of the behavior.
  3. If the network output is 1d (as assumed in the post), this just means that the loss is a sum over losses on individual inputs.
  4. We can extend to larger outputs by having the behavior $f$ be the flattened concatenation of outputs. The rank equality condition is still satisfied for MSE, Binary Cross Entropy, and Cross Entropy over a probability vector. It is not satisfied if we consider the behavior to be raw logits (before the softmax) and softmax+CrossEntropy as the loss function. But we can easily fix that by considering probability (after softmax) as behavior instead of raw logits.
  - carboniferous_umbraculum 27 May 2022 8:29 UTC
    LW: 3 AF: 1
    AF Parent
    Thanks again for the reply.
    
    In my notation, something like $\nabla l$ or $J f$ are functions in and of themselves. The function $\nabla l$ evaluates to zero at local minima of $l$ .
    
    In my notation, there isn’t any such thing as $\nabla_{f} l$ .
    
    But look, I think that this is perhaps getting a little too bogged down for me to want to try to neatly resolve in the comment section, and I expect to be away from work for the next few days so may not check back for a while. Personally, I would just recommend going back and slowly going through the mathematical details again, checking every step at the lowest level of detail that you can and using the notation that makes most sense to you.