Thank you for this wonderful article! I read it fairly carefully and have a number of questions and comments.
where the partition function (or in Bayesian terms the evidence) is given by
Zn=∫Wφ(w)e−nLn(w)dw.
Should I think of this as being equal to p((Yi)|(Xi)), and would you call this quantity p(Dn)? I was a bit confused since it seems like we’re not interested in the data likelihood, but only the conditional data likelihood under model p.
And to be clear: This does not factorize over i because every data point informs w and thereby the next data point, correct?
The learning goal is to find small regions of parameter space with high posterior density, and therefore low free energy.
But the free energy does not depend on the parameter, so how should I interpret this claim? Are you already one step ahead and thinking about the singular case where the loss landscape decomposes into different “phases” with their own free energy?
there is almost sure convergence Sn→S as n→∞ to a constant S that doesn’t depend on n, [5]
S=EX[−logq(y|x)]=−∬RN+Mq(y,x)logq(y|x)dxdy,
I think the first expression should either be an expectation over XY, or have the conditional entropy H(Y|x) within the parantheses.
In the realisable case where q(y|x)=p(y|x,w(0)), the KL divergence is just the euclidean distance between the model and the truth adjusted for the prior measure on inputs,
K(w)=12∫RN∥f(x,w)−f(x,w(0))∥2q(x)dx.
I briefly tried showing this and somehow failed. I didn’t quite manage to get rid of the integral over y. Is this simple? (You don’t need to show me how it’s done, but maybe mentioning the key idea could be useful)
A regular statistical model class is one which is identifiable (so p(y|x,w1)=p(y|x,w2) implies that w1=w2), and has positive definite Fisher information matrix I(w) for all w∈W.
The rest of the article seems to mainly focus on the case of the Fisher information matrix. In particular, you didn’t show an example of a non-regular model where the Fisher information matrix is positive definite everywhere.
Is it correct to assume models which are merely non-regular because the map from parameters to distributions is non-injective aren’t that interesting, and so you maybe don’t even want to call them singular? I found this slightly ambiguous, also because under your definitions further down, it seems like “singular” (degenerate Fisher information matrix) is a stronger condition then “strictly singular” (degenerate Fisher information matrix OR non-injective map from parameters to distributions).
It can be easily shown that, under the regression model, I(w(0)) is degenerate if and only the set of derivatives
{∂∂wjf(x,w)}dj=1
is linearly dependent.
What is x in this formula? Is it fixed? Or do we average the derivatives over the input distribution?
Since every true parameter is a degenerate singularity[9] of K(w), it cannot be approximated by a quadratic form.
Hhm, I thought having a singular model just means that some singularities are degenerate.
One unrelated conceptual question: when I see people draw singularities in the loss landscape, for example in Jesse’s post, they often “look singular”: i.e., the set of minimal points in the loss landscape crosses itself. However, this doesn’t seem to actually be the case: a perfectly smooth curve of loss-minimizing points will consist of singularities because in the direction of the curve, the derivative does not change. Is this correct?
I think you forgot a |w=w0 in the term of degree 1.
In that case, the second term involving ∂φ(w)∂w vanishes since it is the first central moment of a normal distribution
Could you explain why that is? I may have missed some assumption on φ(w) or not paid attention to something.
In this case, since K(0,w2)=0 for all w2, we could simply throw out the free parameter w2 and define a regular model with d1=1 parameters that has identical geometry K(w1)=w21, and therefore defines the same input-output function, f(x,(w1,w2))=f(x,w1).
Hhm. Is the claim that if the loss of the function does not change along some curve in the parameter space, then the function itself remains invariant? Why is that?
Then the dimension d1 arises as the scaling exponent of ε12, which can be extracted via the following ratio of volumes formula for some a∈(0,1):
d1=2limε→0log(V(aε))/log(V(ε))loga.
This scaling exponent, it turns out, is the correct way to think about dimensionality of singularities.
Are you sure this is the correct formula? When I tried computing this by hand it resulted in 2/log(a), but maybe I made a mistake.
General unrelated question: is the following a good intuition for the correspondence of the volume with the effective number of parameters around a singularity? The larger the number of effective parameters λaround w0, the more K(w)blows up around w0in all directions because we get variation in all directions, and so the smaller the region where K(w) is below ϵ. So λcontributes to this volume. This is in fact what it does in the formulas, by being an exponent for small ϵ.
So, in this case the global RLCT is λ=λ0, which we will see in DSLT2 means that the posterior is most concentrated around the singularity w(0)0.
Do you currently expect that gradient descent will do something similar, where the parameters will move toward singularities with low RLCT? What’s the state of the theory regarding this? (If this is answered in later posts, feel free to just refer to them)
Also, I wonder whether this could be studied experimentally even if the theory is not yet ready: one could probably measure the RLCT around minimal loss points by measuring volumes, and then just check whether gradient descent actually ends up in low-RLCT regions. Maybe this is what you do in later posts. If this is the case, I wonder whether I should be surprised or not: it seems like the lower the RLCT, the larger the number of (fractional) directions where the loss is minimal, and so the larger the basin. So for purely statistical reasons, one may end up in such a region instead of isolated loss-minimizing points of high RLCT.
Thanks for the comment Leon! Indeed, in writing a post like this, there are always tradeoffs in which pieces of technicality to dive into and which to leave sufficiently vague so as to not distract from the main points. But these are all absolutely fair questions so I will do my best to answer them (and make some clarifying edits to the post, too). In general I would refer you to my thesis where the setup is more rigorously explained.
Should I think of this as being equal to p((Y_i)|(X_i)), and would you call this quantity p(Dn)? I was a bit confused since it seems like we’re not interested in the data likelihood, but only the conditional data likelihood under model p.
The partition function is equal to the model evidence Zn=p(Dn), yep. It isn’t equal to p((Yi)|(Xi)),(I assume i is fixed here?) but is instead expressed in terms of the model likelihood and prior (and can simply be thought of as the “normalising constant” of the posterior),
p(Dn)=∫Wφ(w)n∏i=1p(yi,xi|w)dw
and then under this supervised learning setup where we know q(xi), we have p(yi,xi|w)=p(yi|xi,w)q(xi). Also note that this does “factor over i” (if I’m interpreting you correctly) since the data is independent and identically distributed.
But the free energy does not depend on the parameter, so how should I interpret this claim? Are you already one step ahead and thinking about the singular case where the loss landscape decomposes into different “phases” with their own free energy?
Yep, you caught me—I was one step ahead. The free energy over the whole space W is still a very useful quantity as it tells you “how good” the best model in the model class is. But Fn by itself doesn’t tell you much about what else is going on in the loss landscape. For that, you need to localise to smaller regions and analyse their phase structure, as presented in DSLT2.
I think the first expression should either be an expectation over XY, or have the conditional entropy H(Y|x) within the parantheses.
Ah, yes, you are right—this is a notational hangover from my thesis where I defined EX to be equal to expectation with respect to the true distribution q(y,x). (Things get a little bit sloppy when you have this known q(x) floating around everywhere—you eventually just make a few calls on how to write the cleanest notation, but I agree that in the context of this post it’s a little confusing so I apologise).
I briefly tried showing this and somehow failed. I didn’t quite manage to get rid of the integral over y. Is this simple? (You don’t need to show me how it’s done, but maybe mentioning the key idea could be useful)
See Lemma A.2 in my thesis. One uses a fairly standard argument involving the first central moment of a Gaussian.
The rest of the article seems to mainly focus on the case of the Fisher information matrix. In particular, you didn’t show an example of a non-regular model where the Fisher information matrix is positive definite everywhere.
Is it correct to assume models which are merely non-regular because the map from parameters to distributions is non-injective aren’t that interesting, and so you maybe don’t even want to call them singular?
Yep, the rest of the article does focus on the case where the Fisher information matrix is degenerate because it is far more interesting and gives rise to an interesting singularity structure (i.e. most of the time it will yield an RLCT λ<d2). Unless my topology is horrendously mistaken, if one has a singular model class for which every parameter has a positive definite Fisher information, then this implies the non-identifiability condition simply means you have a set of isolated points w1,…,wn that all have the same RLCT d2. Thus the free energy will only depend on their inaccuracy Ln(w), meaning every optimal parameter has the same free energy—not particularly interesting! An example of this would be something like the permutation symmetry of ReLU neural networks that I discuss in DSLT3.
I found this slightly ambiguous, also because under your definitions further down, it seems like “singular” (degenerate Fisher information matrix) is a stronger condition then “strictly singular” (degenerate Fisher information matrix OR non-injective map from parameters to distributions).
I have clarified the terminology in the section where they are defined—thanks for picking me up on that. In particular, a singular model class can be either strictly singular or regular—Watanabe’s results hold regardless of identifiability or the degeneracy of the Fisher information. (Sometimes I might accidentally use the word “singular” to emphasise a model which “has non-regular points”—the context should make it relatively clear).
What is x in this formula? Is it fixed? Or do we average the derivatives over the input distribution?
Refer to Theorem 3.1 and Lemma 3.2 in my thesis. The Fisher information involves an integral wrt q(x)dx, so the Fisher information is degenerate iff that set is dependent as a function ofx, in other words, for all x values in the domain specified by q(x) (well, more precisely, for all non-measure-zero regions as specified by q(x)).
Hhm, I thought having a singular model just means that some singularities are degenerate.
Typo—thanks for that.
One unrelated conceptual question: when I see people draw singularities in the loss landscape, for example in Jesse’s post, they often “look singular”: i.e., the set of minimal points in the loss landscape crosses itself. However, this doesn’t seem to actually be the case: a perfectly smooth curve of loss-minimizing points will consist of singularities because in the direction of the curve, the derivative does not change. Is this correct?
Correct! When we use the word “singularity”, we are specifically referring to singularities ofK(w) in the sense of algebraic geometry, so they are zeroes (or zeroes of a level set), and critical points with ∇K(w(0))=0. So, even in regular models, the single optimal parameter is a singularity of K(w) - it just a really, really uninteresting one. In SLT, every singularity needs to be put into normal crossing form via the resolution of singularities, regardless of whether it is a singularity in the sense that you describe (drawing self-intersecting curves, looking at cusps, etc.). But for cartoon purposes, those sorts of curves are good visualisation tools.
I think you forgot a |w=w0 in the term of degree 1.
Typo—thanks.
Could you explain why that is? I may have missed some assumption on φ(w) or not paid attention to something.
Hhm. Is the claim that if the loss of the function does not change along some curve in the parameter space, then the function itself remains invariant? Why is that?
This is a fair question. When concerning the zeroes, by the formula for K(w) when the truth is realisable one shows that
W0={w|f(x,w)=f(x,w0)},
so any path in the set of true parameters (i.e. in this case the set W0={(w1,w2)|w0=0andw2∈R}) will indeed produce the same input-output function. In general (away from the zeroes of K(w)), I don’t think this is necessarily true but I’d have to think a bit harder about it. In this pathological case it is, but I wouldn’t get bogged down in it—I’m just saying “K(w) tells us one parameter can literally be thrown out without changing anything about the model”. (Note here that w2 is literally a free parameter across all of W).
Are you sure this is the correct formula? When I tried computing this by hand it resulted in 2/log(a), but maybe I made a mistake.
Ah! Another typo—thank you very much. It should be
λ=2limε→0log(V(aε)/V(ε))loga.
General unrelated question: is the following a good intuition for the correspondence of the volume with the effective number of parameters around a singularity? The larger the number of effective parameters λaround w0, the more K(w)blows up around w0in all directions because we get variation in all directions, and so the smaller the region where K(w) is below ϵ. So λcontributes to this volume. This is in fact what it does in the formulas, by being an exponent for small ϵ.
I think that’s a very reasonable intuition to have, yep! Moreover, if one wants to compare the “flatness” between 110w2 versus w4, the point is that within a small neighbourhood of the singularity, a higher exponent (RLCTs of 12 and 14 respectively here) is “much flatter” than a low coefficient (the 110). This is what the RLCT is picking up.
Do you currently expect that gradient descent will do something similar, where the parameters will move toward singularities with low RLCT? What’s the state of the theory regarding this?
We do expect that SGD is roughly equivalent to sampling from the Bayesian posterior and therefore that it moves towards regions of low RLCT, yes! But this is nonetheless just a postulate for the moment. If one treats K(w) as a Hamiltonian energy function, then you can apply a full-throated physics lens to this entire setup (see DSLT4) and see that the critical points of K(w) strongly affect the trajectories of the particles. Then the connection between SGD and SLT is really just the extent to which SGD is “acting like a particle subject to a Hamiltonian potential”. (A variant called SGLD seems to be just that, so maybe the question is under what conditions / to what extent does SGD = SGLD?). Running experiments that test whether variants of SGD end up in low RLCT regions of K(w) is definitely a fruitful path forward.
Thanks for the answer Liam! I especially liked the further context on the connection between Bayesian posteriors and SGD. Below a few more comments on some of your answers:
The partition function is equal to the model evidence Zn=p(Dn), yep. It isn’t equal to p((Yi)|(Xi)),(I assume i is fixed here?) but is instead expressed in terms of the model likelihood and prior (and can simply be thought of as the “normalising constant” of the posterior),
p(Dn)=∫Wφ(w)n∏i=1p(yi,xi|w)dw
and then under this supervised learning setup where we know q(xi), we have p(yi,xi|w)=p(yi|xi,w)q(xi). Also note that this does “factor over i” (if I’m interpreting you correctly) since the data is independent and identically distributed.
I think I still disagree. I think everything in these formulas needs to be conditioned on the X-part of the dataset. In particular, I think the notation p(Dn) is slightly misleading, but maybe I’m missing something here.
I’ll walk you through my reasoning: When I write (Xi) or (Yi), I mean the whole vectors, e.g., (Xi)i=1,…,n. Then I think the posterior compuation works as follows:
That is just Bayes rule, conditioned on (Xi) in every term. Then, p(w∣(Xi))=φ(w) because from Xalone you don’t get any new information about the conditional q(Y∣X) (A more formal way to see this is to write down the Bayesian network of the model and to see that w and Xi are d-separated). Also, conditioned on w, p is independent over data points, and so we obtain
p(w∣Dn)=1p((Yi)∣(Xi))⋅e−nLn(w)⋅φ(w).
So, comparing with your equations, we must have Zn=p((Yi)∣(Xi)). Do you think this is correct?
Btw., I still don’t think this “factors over i”. I think that
Zn≠∏ni=1p(Yi∣Xi).
The reason is that old data points should inform the parameter w, which should have an influence on future updates. I think the independence assumption only holds for the true distribution and the model conditioned on w.
I think these are helpful clarifying questions and comments from Leon. I saw Liam’s response. I can add to some of Liam’s answers about some of the definitions of singular models and singularities.
1. Conditions of regularity: Identifiability vs. regular Fisher information matrix
Liam: A regular statistical model class is one which is identifiable (so p(y|x,w1)=p(y|x,w2) implies that w1=w2), and has positive definite Fisher information matrix I(w) for all w∈W.
Leon: The rest of the article seems to mainly focus on the case of the Fisher information matrix. In particular, you didn’t show an example of a non-regular model where the Fisher information matrix is positive definite everywhere.
Is it correct to assume models which are merely non-regular because the map from parameters to distributions is non-injective aren’t that interesting, and so you maybe don’t even want to call them singular?
As Liam said, I think the answer is yes—the emphasis of singular learning theory is on the degenerate Fisher information matrix (FIM) case. Strictly speaking, all three classes of models (regular, non-identifiable, degenerate FIM) are “singular”, as “singular” is defined by Watanabe. But the emphasis is definitely on the ‘more’ singular models (with degenerate FIM) which is the most complex case and also includes neural networks.
As for non-identifiability being uninteresting, as I understand, non-regularity arising from certain kinds of non-local non-identifiability can be easily dealt with by re-parametrising the model or just restricting consideration to some neighbourhood of (one copy of) the true parameter, or by similar tricks. So, the statistics of learning in these models is not strictly-speaking regular to begin with, but we can still get away with regular statistics by applying such tricks.
Liam mentions the permutation symmetries in neural networks as an example. To clarify, this symmetry usually creates a discrete set of equivalent parameters that are separated from each other in parameter space. But the posterior will also be reflected along these symmetries so you could just get away with considering a single ‘slice’ of the parameter space where every function is represented by at most one parameter (if this were the only source of non-identifiability—it turns out that’s not true for neural networks).
It’s worth noting that these tricks don’t generally apply to models with local non-identifiability. Local non-identifiability =roughly there are extra true parameters in every neighbourhood of some true parameter. However, local non-identifiability implies that the FIM is degenerate at that true parameter, so again we are back in the degenerate FIM case.
2. Linear independence condition on Fisher information matrix degeneracy
Leon: What is x in this formula [”{∂∂wjf(x,w)}dj=1 is linearly independent”]? Is it fixed? Or do we average the derivatives over the input distribution?
Yeah I remember also struggling to parse this statement when I first saw it. Liam answered but in case it’s still not clear and/or someone doesn’t want to follow up in Liam’s thesis, x is a free variable, and the condition is talking about linear dependence of functions of x.
Consider a toy example (not a real model) to help spell out the mathematical structure involved: Let f(x,w)=(w1+2w2)x so that ∂∂w1f(x,w)=x and ∂∂w2f(x,w)=2x. Then let g and h be functions such that g(x)=x and h(x)=2x.. Then the set of functions {g,h} is a linearly dependent set of functions because h−2g=0.
3. Singularities vs. visually obvious singularities (self-intersecting curves)
Leon: One unrelated conceptual question: when I see people draw singularities in the loss landscape, for example in Jesse’s post, they often “look singular”: i.e., the set of minimal points in the loss landscape crosses itself. However, this doesn’t seem to actually be the case: a perfectly smooth curve of loss-minimizing points will consist of singularities because in the direction of the curve, the derivative does not change [sic: ‘derivative is zero’, or ’loss does not change, right?]. Is this correct?
Right, as Liam said, often[1] in SLT we are talking about singularities of the Kullback-Leiber loss function. Singularities of a function are defined as points where the function is zero and has zero gradient. Since K is non-negative, all of its zeros are also local (actually global) minima, so they also have zero gradient. Among these singularities, some are ‘more singular’ than others. Liam pointed to the distinction between degenerate singularities and non-degenerate singularities. More generally, we can use the RLCT as a measure of ‘how singular’ a singularity is (lower RLCT = more singular).
As for the intuition about visually reasoning about singularities based on the picture of a zero set: I agree this is useful, but one should also keep in mind that it is not sufficient. These curves just shows the zero set, but the singularities (and their RLCTs) are defined not just based on the shape of the zero set but also based on the local shape of the function around the zero set.
Here’s an example that might clarify. Consider two functions J,K:R2→R such that J(x,y)=xy and K(x,y)=x2y2. Then these functions both have the same zero set {(x,y):x=0∨y=0}. That set has an intersection at the origin. Observe the following:
Both J(0,0)=0 and ∇J(0,0)=→0, so the intersection is a singularity in the case of J.
The other points on the zero set of J are not singular. E.g. if y=0 but x≠0, then ∇J(x,0)=(0,x)≠→0.
Even though K has the exact same zero set, all of its zeros are singular points! Observe ∇K(x,y)=(2xy2,2x2y), which is zero everywhere on the zero set.
In general, it’s a true intuition that intersections of lines in zero sets correspond to singular points. But this example shows that whether non-intersecting points of the zero set are singular points depends on more than just the shape of the zero set itself.
In singular learning theory, the functions we consider are non-negative (Kullback—Leibler divergence), so you don’t get functions like J with non-critical zeros. However, the same argument here about existence of singularities could be extended to the danger of reasoning about the extent of singularity of singular points based on just looking at the shape of the zero set: the RLCT will depend on how the function behaves in the neighbourhood, not just on the zero set.
One exception, you could say, is in the definition of strictly singular models. There, as we discussed, we had a condition involving the degeneracy of the Fisher information matrix (FIM) at a parameter. Degenerate matrix = non-invertible matrix = also called singular matrix. I think you could call these parameters ‘singularities’ (of the model).
One subtle point in this notion of singular parameter is that the definition of the FIM at a parameter w involves setting the true parameter to w. For a fixed true parameter, the set of singularities (zeros of KL loss wrt. that true parameter) will not generally coincide with the set of singularities (parameters where the FIM is degenerate).
Alternatively, you could consider the FIM condition in the definition of a non-regular model to be saying “if a model would have degenerate singularities at some parameter if that were the true parameter, then the model is non-regular”.
Yeah I remember also struggling to parse this statement when I first saw it. Liam answered but in case it’s still not clear and/or someone doesn’t want to follow up in Liam’s thesis, x is a free variable, and the condition is talking about linear dependence of functions of x.
Consider a toy example (not a real model) to help spell out the mathematical structure involved: Let f(x,w)=(w1+2w2)x so that ∂∂w1f(x,w)=x and ∂∂w2f(x,w)=2x. Then let g and h be functions such that g(x)=x and h(x)=2x.. Then the set of functions {g,h} is a linearly dependent set of functions because h−2g=0.
Thanks! Apparently the proof of the thing I was wondering about can be found in Lemma 3.4 in Liam’s thesis. Also thanks for your other comments!
Thank you for this wonderful article! I read it fairly carefully and have a number of questions and comments.
Should I think of this as being equal to p((Yi)|(Xi)), and would you call this quantity p(Dn)? I was a bit confused since it seems like we’re not interested in the data likelihood, but only the conditional data likelihood under model p.
And to be clear: This does not factorize over i because every data point informs w and thereby the next data point, correct?
But the free energy does not depend on the parameter, so how should I interpret this claim? Are you already one step ahead and thinking about the singular case where the loss landscape decomposes into different “phases” with their own free energy?
I think the first expression should either be an expectation over XY, or have the conditional entropy H(Y|x) within the parantheses.
I briefly tried showing this and somehow failed. I didn’t quite manage to get rid of the integral over y. Is this simple? (You don’t need to show me how it’s done, but maybe mentioning the key idea could be useful)
The rest of the article seems to mainly focus on the case of the Fisher information matrix. In particular, you didn’t show an example of a non-regular model where the Fisher information matrix is positive definite everywhere.
Is it correct to assume models which are merely non-regular because the map from parameters to distributions is non-injective aren’t that interesting, and so you maybe don’t even want to call them singular? I found this slightly ambiguous, also because under your definitions further down, it seems like “singular” (degenerate Fisher information matrix) is a stronger condition then “strictly singular” (degenerate Fisher information matrix OR non-injective map from parameters to distributions).
What is x in this formula? Is it fixed? Or do we average the derivatives over the input distribution?
Hhm, I thought having a singular model just means that some singularities are degenerate.
One unrelated conceptual question: when I see people draw singularities in the loss landscape, for example in Jesse’s post, they often “look singular”: i.e., the set of minimal points in the loss landscape crosses itself. However, this doesn’t seem to actually be the case: a perfectly smooth curve of loss-minimizing points will consist of singularities because in the direction of the curve, the derivative does not change. Is this correct?
I think you forgot a |w=w0 in the term of degree 1.
Could you explain why that is? I may have missed some assumption on φ(w) or not paid attention to something.
Hhm. Is the claim that if the loss of the function does not change along some curve in the parameter space, then the function itself remains invariant? Why is that?
Are you sure this is the correct formula? When I tried computing this by hand it resulted in 2/log(a), but maybe I made a mistake.
General unrelated question: is the following a good intuition for the correspondence of the volume with the effective number of parameters around a singularity? The larger the number of effective parameters λ around w0, the more K(w) blows up around w0 in all directions because we get variation in all directions, and so the smaller the region where K(w) is below ϵ. So λ contributes to this volume. This is in fact what it does in the formulas, by being an exponent for small ϵ.
Do you currently expect that gradient descent will do something similar, where the parameters will move toward singularities with low RLCT? What’s the state of the theory regarding this? (If this is answered in later posts, feel free to just refer to them)
Also, I wonder whether this could be studied experimentally even if the theory is not yet ready: one could probably measure the RLCT around minimal loss points by measuring volumes, and then just check whether gradient descent actually ends up in low-RLCT regions. Maybe this is what you do in later posts. If this is the case, I wonder whether I should be surprised or not: it seems like the lower the RLCT, the larger the number of (fractional) directions where the loss is minimal, and so the larger the basin. So for purely statistical reasons, one may end up in such a region instead of isolated loss-minimizing points of high RLCT.
Thanks for the comment Leon! Indeed, in writing a post like this, there are always tradeoffs in which pieces of technicality to dive into and which to leave sufficiently vague so as to not distract from the main points. But these are all absolutely fair questions so I will do my best to answer them (and make some clarifying edits to the post, too). In general I would refer you to my thesis where the setup is more rigorously explained.
The partition function is equal to the model evidence Zn=p(Dn), yep. It isn’t equal to p((Yi)|(Xi)),(I assume i is fixed here?) but is instead expressed in terms of the model likelihood and prior (and can simply be thought of as the “normalising constant” of the posterior),
p(Dn)=∫Wφ(w)n∏i=1p(yi,xi|w)dwand then under this supervised learning setup where we know q(xi), we have p(yi,xi|w)=p(yi|xi,w)q(xi). Also note that this does “factor over i” (if I’m interpreting you correctly) since the data is independent and identically distributed.
Yep, you caught me—I was one step ahead. The free energy over the whole space W is still a very useful quantity as it tells you “how good” the best model in the model class is. But Fn by itself doesn’t tell you much about what else is going on in the loss landscape. For that, you need to localise to smaller regions and analyse their phase structure, as presented in DSLT2.
Ah, yes, you are right—this is a notational hangover from my thesis where I defined EX to be equal to expectation with respect to the true distribution q(y,x). (Things get a little bit sloppy when you have this known q(x) floating around everywhere—you eventually just make a few calls on how to write the cleanest notation, but I agree that in the context of this post it’s a little confusing so I apologise).
See Lemma A.2 in my thesis. One uses a fairly standard argument involving the first central moment of a Gaussian.
Yep, the rest of the article does focus on the case where the Fisher information matrix is degenerate because it is far more interesting and gives rise to an interesting singularity structure (i.e. most of the time it will yield an RLCT λ<d2). Unless my topology is horrendously mistaken, if one has a singular model class for which every parameter has a positive definite Fisher information, then this implies the non-identifiability condition simply means you have a set of isolated points w1,…,wn that all have the same RLCT d2. Thus the free energy will only depend on their inaccuracy Ln(w), meaning every optimal parameter has the same free energy—not particularly interesting! An example of this would be something like the permutation symmetry of ReLU neural networks that I discuss in DSLT3.
I have clarified the terminology in the section where they are defined—thanks for picking me up on that. In particular, a singular model class can be either strictly singular or regular—Watanabe’s results hold regardless of identifiability or the degeneracy of the Fisher information. (Sometimes I might accidentally use the word “singular” to emphasise a model which “has non-regular points”—the context should make it relatively clear).
Refer to Theorem 3.1 and Lemma 3.2 in my thesis. The Fisher information involves an integral wrt q(x)dx, so the Fisher information is degenerate iff that set is dependent as a function of x, in other words, for all x values in the domain specified by q(x) (well, more precisely, for all non-measure-zero regions as specified by q(x)).
Typo—thanks for that.
Correct! When we use the word “singularity”, we are specifically referring to singularities of K(w) in the sense of algebraic geometry, so they are zeroes (or zeroes of a level set), and critical points with ∇K(w(0))=0. So, even in regular models, the single optimal parameter is a singularity of K(w) - it just a really, really uninteresting one. In SLT, every singularity needs to be put into normal crossing form via the resolution of singularities, regardless of whether it is a singularity in the sense that you describe (drawing self-intersecting curves, looking at cusps, etc.). But for cartoon purposes, those sorts of curves are good visualisation tools.
Typo—thanks.
If you expand that term out you find that
∫W(w−w0)T∂φ∂w∣∣w=w0exp(−(w−w0)TI(w0)(w−w0))dw=∂φ∂w∣∣w=w0∫W(w−w0)Texp(−(w−w0)TI(w0)(w−w0))dw=0because the second integral is the first central moment of a Gaussian. The derivative of the prior is irrelevant.
This is a fair question. When concerning the zeroes, by the formula for K(w) when the truth is realisable one shows that
W0={w|f(x,w)=f(x,w0)},so any path in the set of true parameters (i.e. in this case the set W0={(w1,w2)|w0=0andw2∈R}) will indeed produce the same input-output function. In general (away from the zeroes of K(w)), I don’t think this is necessarily true but I’d have to think a bit harder about it. In this pathological case it is, but I wouldn’t get bogged down in it—I’m just saying “K(w) tells us one parameter can literally be thrown out without changing anything about the model”. (Note here that w2 is literally a free parameter across all of W).
Ah! Another typo—thank you very much. It should be
λ=2limε→0log(V(aε)/V(ε))loga.I think that’s a very reasonable intuition to have, yep! Moreover, if one wants to compare the “flatness” between 110w2 versus w4, the point is that within a small neighbourhood of the singularity, a higher exponent (RLCTs of 12 and 14 respectively here) is “much flatter” than a low coefficient (the 110). This is what the RLCT is picking up.
We do expect that SGD is roughly equivalent to sampling from the Bayesian posterior and therefore that it moves towards regions of low RLCT, yes! But this is nonetheless just a postulate for the moment. If one treats K(w) as a Hamiltonian energy function, then you can apply a full-throated physics lens to this entire setup (see DSLT4) and see that the critical points of K(w) strongly affect the trajectories of the particles. Then the connection between SGD and SLT is really just the extent to which SGD is “acting like a particle subject to a Hamiltonian potential”. (A variant called SGLD seems to be just that, so maybe the question is under what conditions / to what extent does SGD = SGLD?). Running experiments that test whether variants of SGD end up in low RLCT regions of K(w) is definitely a fruitful path forward.
Thanks for the answer Liam! I especially liked the further context on the connection between Bayesian posteriors and SGD. Below a few more comments on some of your answers:
I think I still disagree. I think everything in these formulas needs to be conditioned on the X-part of the dataset. In particular, I think the notation p(Dn) is slightly misleading, but maybe I’m missing something here.
I’ll walk you through my reasoning: When I write (Xi) or (Yi), I mean the whole vectors, e.g., (Xi)i=1,…,n. Then I think the posterior compuation works as follows:
p(w∣Dn)=p(w∣(Yi),(Xi))=p((Yi)∣(Xi),w)⋅p(w∣(Xi))p((Yi)∣(Xi)).That is just Bayes rule, conditioned on (Xi) in every term. Then, p(w∣(Xi))=φ(w) because from Xalone you don’t get any new information about the conditional q(Y∣X) (A more formal way to see this is to write down the Bayesian network of the model and to see that w and Xi are d-separated). Also, conditioned on w, p is independent over data points, and so we obtain
p(w∣Dn)=1p((Yi)∣(Xi))⋅e−nLn(w)⋅φ(w).So, comparing with your equations, we must have Zn=p((Yi)∣(Xi)). Do you think this is correct?
Btw., I still don’t think this “factors over i”. I think that
Zn≠∏ni=1p(Yi∣Xi).
The reason is that old data points should inform the parameter w, which should have an influence on future updates. I think the independence assumption only holds for the true distribution and the model conditioned on w.
Right. that makes sense, thank you! (I think you missed a factor of n/2, but that doesn’t change the conclusion)
Thanks also for the corrected volume formula, it makes sense now :)
I think these are helpful clarifying questions and comments from Leon. I saw Liam’s response. I can add to some of Liam’s answers about some of the definitions of singular models and singularities.
1. Conditions of regularity: Identifiability vs. regular Fisher information matrix
As Liam said, I think the answer is yes—the emphasis of singular learning theory is on the degenerate Fisher information matrix (FIM) case. Strictly speaking, all three classes of models (regular, non-identifiable, degenerate FIM) are “singular”, as “singular” is defined by Watanabe. But the emphasis is definitely on the ‘more’ singular models (with degenerate FIM) which is the most complex case and also includes neural networks.
As for non-identifiability being uninteresting, as I understand, non-regularity arising from certain kinds of non-local non-identifiability can be easily dealt with by re-parametrising the model or just restricting consideration to some neighbourhood of (one copy of) the true parameter, or by similar tricks. So, the statistics of learning in these models is not strictly-speaking regular to begin with, but we can still get away with regular statistics by applying such tricks.
Liam mentions the permutation symmetries in neural networks as an example. To clarify, this symmetry usually creates a discrete set of equivalent parameters that are separated from each other in parameter space. But the posterior will also be reflected along these symmetries so you could just get away with considering a single ‘slice’ of the parameter space where every function is represented by at most one parameter (if this were the only source of non-identifiability—it turns out that’s not true for neural networks).
It’s worth noting that these tricks don’t generally apply to models with local non-identifiability. Local non-identifiability =roughly there are extra true parameters in every neighbourhood of some true parameter. However, local non-identifiability implies that the FIM is degenerate at that true parameter, so again we are back in the degenerate FIM case.
2. Linear independence condition on Fisher information matrix degeneracy
Yeah I remember also struggling to parse this statement when I first saw it. Liam answered but in case it’s still not clear and/or someone doesn’t want to follow up in Liam’s thesis, x is a free variable, and the condition is talking about linear dependence of functions of x.
Consider a toy example (not a real model) to help spell out the mathematical structure involved: Let f(x,w)=(w1+2w2)x so that ∂∂w1f(x,w)=x and ∂∂w2f(x,w)=2x. Then let g and h be functions such that g(x)=x and h(x)=2x.. Then the set of functions {g,h} is a linearly dependent set of functions because h−2g=0.
3. Singularities vs. visually obvious singularities (self-intersecting curves)
Right, as Liam said, often[1] in SLT we are talking about singularities of the Kullback-Leiber loss function. Singularities of a function are defined as points where the function is zero and has zero gradient. Since K is non-negative, all of its zeros are also local (actually global) minima, so they also have zero gradient. Among these singularities, some are ‘more singular’ than others. Liam pointed to the distinction between degenerate singularities and non-degenerate singularities. More generally, we can use the RLCT as a measure of ‘how singular’ a singularity is (lower RLCT = more singular).
As for the intuition about visually reasoning about singularities based on the picture of a zero set: I agree this is useful, but one should also keep in mind that it is not sufficient. These curves just shows the zero set, but the singularities (and their RLCTs) are defined not just based on the shape of the zero set but also based on the local shape of the function around the zero set.
Here’s an example that might clarify. Consider two functions J,K:R2→R such that J(x,y)=xy and K(x,y)=x2y2. Then these functions both have the same zero set {(x,y):x=0 ∨ y=0}. That set has an intersection at the origin. Observe the following:
Both J(0,0)=0 and ∇J(0,0)=→0, so the intersection is a singularity in the case of J.
The other points on the zero set of J are not singular. E.g. if y=0 but x≠0, then ∇J(x,0)=(0,x)≠→0.
Even though K has the exact same zero set, all of its zeros are singular points! Observe ∇K(x,y)=(2xy2,2x2y), which is zero everywhere on the zero set.
In general, it’s a true intuition that intersections of lines in zero sets correspond to singular points. But this example shows that whether non-intersecting points of the zero set are singular points depends on more than just the shape of the zero set itself.
In singular learning theory, the functions we consider are non-negative (Kullback—Leibler divergence), so you don’t get functions like J with non-critical zeros. However, the same argument here about existence of singularities could be extended to the danger of reasoning about the extent of singularity of singular points based on just looking at the shape of the zero set: the RLCT will depend on how the function behaves in the neighbourhood, not just on the zero set.
One exception, you could say, is in the definition of strictly singular models. There, as we discussed, we had a condition involving the degeneracy of the Fisher information matrix (FIM) at a parameter. Degenerate matrix = non-invertible matrix = also called singular matrix. I think you could call these parameters ‘singularities’ (of the model).
One subtle point in this notion of singular parameter is that the definition of the FIM at a parameter w involves setting the true parameter to w. For a fixed true parameter, the set of singularities (zeros of KL loss wrt. that true parameter) will not generally coincide with the set of singularities (parameters where the FIM is degenerate).
Alternatively, you could consider the FIM condition in the definition of a non-regular model to be saying “if a model would have degenerate singularities at some parameter if that were the true parameter, then the model is non-regular”.
Thanks for the answer mfar!
Thanks! Apparently the proof of the thing I was wondering about can be found in Lemma 3.4 in Liam’s thesis. Also thanks for your other comments!