Liam Carroll comments on DSLT 3. Neural Networks are Singular

Liam Carroll 12 Jul 2023 0:01 UTC
8 points
0
However, is it correct that we need the “underlying truth” to study symmetries that come from other degeneracies of the Fisher information matrix? After all, this matrix involves the true distribution in its definition. The same holds for the Hessian of the KL divergence.
The definition of the Fisher information matrix does not refer to the truth $q (y, x)$ whatsoever. (Note that in the definition I provide I am assuming the supervised learning case where we know the input distribution $q (x)$ , meaning the model is $p (y, x | w) = p (y | x, w) q (x)$ , which is why the $q (x)$ shows up in the formula I just linked to. The derivative terms do not explicitly include $q (x)$ because it just vanishes in the $w_{j}$ derivative anyway, so its irrelevant there. But remember, we are ultimately interested in modelling the conditional true distribution $q (y | x)$ in $q (y, x) = q (y | x) q (x)$ .)
What do you mean with non-weight-annihilation here? Don’t the weights annihilate in both pictures?
You’re right, thats sloppy terminology from me. What I mean is, in the right hand picture (that I originally labelled WA), there is a region in which all nodes are active, but cancel out to give zero effective gradient, which is markedly different to the left hand picture. I have edited this to NonWC and WC instead to clarify, thanks!
What links here?
- Liam Carroll's comment on DSLT 4. Phase Transitions in Neural Networks by Liam Carroll (12 Jul 2023 1:20 UTC; 6 points)