Thanks Liam also for this nice post! The explanations were quite clear.
The property of being singular is specific to a model class f(x,w), regardless of the underlying truth.
This holds for singularities that come from symmetries where the model doesn’t change. However, is it correct that we need the “underlying truth” to study symmetries that come from other degeneracies of the Fisher information matrix? After all, this matrix involves the true distribution in its definition. The same holds for the Hessian of the KL divergence.
Both configurations, non-weight-annihilation (left) and weight-annihilation (right)
What do you mean with non-weight-annihilation here? Don’t the weights annihilate in both pictures?
However, is it correct that we need the “underlying truth” to study symmetries that come from other degeneracies of the Fisher information matrix? After all, this matrix involves the true distribution in its definition. The same holds for the Hessian of the KL divergence.
The definition of the Fisher information matrix does not refer to the truth q(y,x) whatsoever. (Note that in the definition I provide I am assuming the supervised learning case where we know the input distribution q(x), meaning the model is p(y,x|w)=p(y|x,w)q(x), which is why the q(x) shows up in the formula I just linked to. The derivative terms do not explicitly include q(x) because it just vanishes in the wj derivative anyway, so its irrelevant there. But remember, we are ultimately interested in modelling the conditional true distribution q(y|x) in q(y,x)=q(y|x)q(x).)
What do you mean with non-weight-annihilation here? Don’t the weights annihilate in both pictures?
You’re right, thats sloppy terminology from me. What I mean is, in the right hand picture (that I originally labelled WA), there is a region in which all nodes are active, but cancel out to give zero effective gradient, which is markedly different to the left hand picture. I have edited this to NonWC and WC instead to clarify, thanks!
Thanks Liam also for this nice post! The explanations were quite clear.
This holds for singularities that come from symmetries where the model doesn’t change. However, is it correct that we need the “underlying truth” to study symmetries that come from other degeneracies of the Fisher information matrix? After all, this matrix involves the true distribution in its definition. The same holds for the Hessian of the KL divergence.
What do you mean with non-weight-annihilation here? Don’t the weights annihilate in both pictures?
The definition of the Fisher information matrix does not refer to the truth q(y,x) whatsoever. (Note that in the definition I provide I am assuming the supervised learning case where we know the input distribution q(x), meaning the model is p(y,x|w)=p(y|x,w)q(x), which is why the q(x) shows up in the formula I just linked to. The derivative terms do not explicitly include q(x) because it just vanishes in the wj derivative anyway, so its irrelevant there. But remember, we are ultimately interested in modelling the conditional true distribution q(y|x) in q(y,x)=q(y|x)q(x).)
You’re right, thats sloppy terminology from me. What I mean is, in the right hand picture (that I originally labelled WA), there is a region in which all nodes are active, but cancel out to give zero effective gradient, which is markedly different to the left hand picture. I have edited this to NonWC and WC instead to clarify, thanks!