I think these are helpful clarifying questions and comments from Leon. I saw Liam’s response. I can add to some of Liam’s answers about some of the definitions of singular models and singularities.
1. Conditions of regularity: Identifiability vs. regular Fisher information matrix
Liam: A regular statistical model class is one which is identifiable (so implies that ), and has positive definite Fisher information matrix for all .
Leon: The rest of the article seems to mainly focus on the case of the Fisher information matrix. In particular, you didn’t show an example of a non-regular model where the Fisher information matrix is positive definite everywhere.
Is it correct to assume models which are merely non-regular because the map from parameters to distributions is non-injective aren’t that interesting, and so you maybe don’t even want to call them singular?
As Liam said, I think the answer is yes—the emphasis of singular learning theory is on the degenerate Fisher information matrix (FIM) case. Strictly speaking, all three classes of models (regular, non-identifiable, degenerate FIM) are “singular”, as “singular” is defined by Watanabe. But the emphasis is definitely on the ‘more’ singular models (with degenerate FIM) which is the most complex case and also includes neural networks.
As for non-identifiability being uninteresting, as I understand, non-regularity arising from certain kinds of non-local non-identifiability can be easily dealt with by re-parametrising the model or just restricting consideration to some neighbourhood of (one copy of) the true parameter, or by similar tricks. So, the statistics of learning in these models is not strictly-speaking regular to begin with, but we can still get away with regular statistics by applying such tricks.
Liam mentions the permutation symmetries in neural networks as an example. To clarify, this symmetry usually creates a discrete set of equivalent parameters that are separated from each other in parameter space. But the posterior will also be reflected along these symmetries so you could just get away with considering a single ‘slice’ of the parameter space where every function is represented by at most one parameter (if this were the only source of non-identifiability—it turns out that’s not true for neural networks).
It’s worth noting that these tricks don’t generally apply to models with local non-identifiability. Local non-identifiability =roughly there are extra true parameters in every neighbourhood of some true parameter. However, local non-identifiability implies that the FIM is degenerate at that true parameter, so again we are back in the degenerate FIM case.
2. Linear independence condition on Fisher information matrix degeneracy
Leon: What is in this formula [” is linearly independent”]? Is it fixed? Or do we average the derivatives over the input distribution?
Yeah I remember also struggling to parse this statement when I first saw it. Liam answered but in case it’s still not clear and/or someone doesn’t want to follow up in Liam’s thesis, is a free variable, and the condition is talking about linear dependence of functions of .
Consider a toy example (not a real model) to help spell out the mathematical structure involved: Let so that and . Then let and be functions such that and .. Then the set of functions is a linearly dependent set of functions because .
3. Singularities vs. visually obvious singularities (self-intersecting curves)
Leon: One unrelated conceptual question: when I see people draw singularities in the loss landscape, for example in Jesse’s post, they often “look singular”: i.e., the set of minimal points in the loss landscape crosses itself. However, this doesn’t seem to actually be the case: a perfectly smooth curve of loss-minimizing points will consist of singularities because in the direction of the curve, the derivative does not change [sic: ‘derivative is zero’, or ’loss does not change, right?]. Is this correct?
Right, as Liam said, often[1] in SLT we are talking about singularities of the Kullback-Leiber loss function. Singularities of a function are defined as points where the function is zero and has zero gradient. Since is non-negative, all of its zeros are also local (actually global) minima, so they also have zero gradient. Among these singularities, some are ‘more singular’ than others. Liam pointed to the distinction between degenerate singularities and non-degenerate singularities. More generally, we can use the RLCT as a measure of ‘how singular’ a singularity is (lower RLCT = more singular).
As for the intuition about visually reasoning about singularities based on the picture of a zero set: I agree this is useful, but one should also keep in mind that it is not sufficient. These curves just shows the zero set, but the singularities (and their RLCTs) are defined not just based on the shape of the zero set but also based on the local shape of the function around the zero set.
Here’s an example that might clarify. Consider two functions such that and . Then these functions both have the same zero set . That set has an intersection at the origin. Observe the following:
Both and , so the intersection is a singularity in the case of .
The other points on the zero set of are not singular. E.g. if but , then .
Even though has the exact same zero set, all of its zeros are singular points! Observe , which is zero everywhere on the zero set.
In general, it’s a true intuition that intersections of lines in zero sets correspond to singular points. But this example shows that whether non-intersecting points of the zero set are singular points depends on more than just the shape of the zero set itself.
In singular learning theory, the functions we consider are non-negative (Kullback—Leibler divergence), so you don’t get functions like with non-critical zeros. However, the same argument here about existence of singularities could be extended to the danger of reasoning about the extent of singularity of singular points based on just looking at the shape of the zero set: the RLCT will depend on how the function behaves in the neighbourhood, not just on the zero set.
- ^
One exception, you could say, is in the definition of strictly singular models. There, as we discussed, we had a condition involving the degeneracy of the Fisher information matrix (FIM) at a parameter. Degenerate matrix = non-invertible matrix = also called singular matrix. I think you could call these parameters ‘singularities’ (of the model).
One subtle point in this notion of singular parameter is that the definition of the FIM at a parameter involves setting the true parameter to . For a fixed true parameter, the set of singularities (zeros of KL loss wrt. that true parameter) will not generally coincide with the set of singularities (parameters where the FIM is degenerate).
Alternatively, you could consider the FIM condition in the definition of a non-regular model to be saying “if a model would have degenerate singularities at some parameter if that were the true parameter, then the model is non-regular”.
There is a typo in the transcript. The name of the creator of singular learning theory is “Sumio Watanabe” rather than “Sumio Aranabe”.