Thanks to Jesse Hoogland and George Wang for feedback on these exercises.
In learning singular learning theory (SLT), I found it was often much easier to understand by working through examples, rather than try to work through the (fairly technical) theorems in their full generality. These exercises are an attempt to collect the sorts of examples that I worked through to understand SLT.
Before doing these exercises, you should have read the Distilling Singular Learning Theory (DSLT) sequence, watched the SLT summit YouTube videos, or studied something equivalent. DSLT is a good reference to keep open while solving these problems, perhaps alongside Watanabe’s textbook, the Gray Book. Note that some of these exercises cover the basics, which are well-covered in the above distillations, but some deliver material which will likely be new to you (because it’s buried deep in a textbook, because it’s only found in adjacent literature, etc).
Exercises are presented mostly in conceptual order: later exercises freely use concepts developed in earlier exercises. Starred (*) exercises are what I consider the most essential exercises, and the ones I recommend you complete first.
*The normal distribution, like most classical statistical models, is a regular (i.e. non-singular[1]) statistical model. A univariate normal model with unit variance and mean μ∈R is given by the probability density p(x|μ)=1√2πexp(−12(x−μ)2). Assume the true distribution q(x) of the data is realizable by the model: that is, q(x)=p(x|μ0) for some true parameterμ0.
Calculate the Fisher information matrix of this model (note that since we have only a single parameter, the FIM will be a 1x1 matrix). Use this to show the model is regular.
Write an explicit expression for the KL divergence K(μ) between q(x) and p(x|μ), as a function of the parameter μ. This quantity is sometimes also called the population loss. [See Example 1.1, Gray Book, for the case of a 2D normal distribution]
Using K(μ) from b), give an explicit formula for the volume of “almost optimal” parameters, V(ϵ)=∫{μ∣K(μ)<ϵ}φ(μ)dμ, where φ(μ) is the prior distribution. For convenience, let φ(μ) be the improper prior φ(μ)=1.
The volume scaling formula for the learning coefficient λ (also known as RLCT[2]) is λ=limϵ→0log(V(aϵ)/V(ϵ))log(a) for any a≠1 [Theorem 7.1, Gray Book]. Using this formula, combined with the expression for V(ϵ) derived in b), calculate the learning coefficient[3]. Given that the model is regular, we expect the learning coefficient to be d2=12; compare your answer.
*We can make the normal distribution a singular model by changing the parameterization. Let a cubicly-parameterized normal model be the model p(x|μ)=1√2πexp(−12(x−μ3)2). Assume the true parameter is μ0.
Show that the cubicly-parameterized normal model is just as expressive as an ordinary normal model: that is, they both can express all univariate normal distributions.
Repeat 1a) with this model; calculate the Fisher information matrix to demonstrate that the model is singular, and find which parameters μ are singular.
Repeat 1b) − 1d) to calculate the learning coefficient this model, for μ0=0, and for μ0≠0. Recall that the learning coefficient is a volume scaling exponent, such that V(ϵ)∝ϵλ[4] as ϵ→0. Based on this, interpret your results. How does this make the cubicly-parameterized normal model different from the ordinary normal model?
Instead of taking ϵ→0 to get the learning coefficient, fix a small but nonzero value for ϵ, such as ϵ=0.01. Plot V(ϵ) as a function of μ0. As we saw from c), the learning coefficient changes discontinuously when μ0=0 - what happens with V(ϵ) as μ0 gets close to zero? What changes if you make ϵ smaller or larger? Even though the asymptotic learning coefficient (ϵ→0) only changes when μ0=0 exactly, note how the non-asymptotic volume (ϵ finite) is affected in a larger neighborhood.
*For each of the two different K(μ) formulas obtained in 1b) and 2c), create plots of K(μ) and e−nK(μ), with μ0 and n as variable sliders (e.g. using Desmos). Note how these functions change as μ0 and n change. Compare results between the K(μ) obtained in 1b) and the K(μ) obtained in 2c).
Consider yet another alternative parameterization of the normal distribution, this time with two parameters: p(x|a,b)=1√2πexp(−12(x−a−b)2). Assume the true distribution is realizable with true parameters a0 and b0. This type of model is called minimally singular, for reasons that will become clear shortly. Minimally singular models can be dealt with easier than more general singular models.
Show that this model is just as expressive as an ordinary normal model; they both can express all univariate normal distributions.
Calculate the Fisher information matrix of the model and show that the model is singular. Further, calculate the eigenvectors of the matrix, and interpret them.
Calculate the learning coefficient using the same procedure as 2c). Note that you will need a proper prior φ(a,b) this time, in order for the integral to converge: a uniform prior over the unit square should work, i.e. φ(a,b)={0.25 if |a|≤1,|b|≤1;0 otherwise}.
As we saw from part b), this model has one direction in parameter space which changes the distribution p(x|μ), and one direction in parameter space which does not—and this is true globally. It seems sensible to just ignore the direction in parameter space which is globally redundant. Apply the formula for the learning coefficient of regular models, but use d=1 instead of d=2. Compare your result to part c).
It turns out that the strategy from part d), merely ignoring or quotienting out redundant submanifolds, works for any minimally singular model. However, many singular models, including neural networks, are not minimally singular. Why does this strategy not work for singular models in general? What would stop you from applying it to the model in problem 2), for instance?
Consider a two-parameter alternatively-parameterized normal distribution: p(x|a,b)=1√2πexp(−12(x−ak1bk2)2), where k1,k2∈N are fixed constants. Assume the true distribution is realizable with true parameters a0 and b0.
Calculate the Fisher information matrix of the model. For what parameters is the model singular? For singular parameters, when is the rank of the matrix zero, and when is it one?
Compute the learning coefficient using the same procedure as 1b), for a0b0=0.
In the Gray Book, Watanabe proves that any model with KL divergence of the form K(w1,w2,…,wr)=w12k1w22k2…wr2kr, known as normal crossing form, has learning coefficient λ=12max({k1,k2,…,k3}) (in the case where the prior is non-vanishing). This result is more than a special case: Watanabe showed that any statistical model may (locally) be converted into normal crossing form, via a change of variables from algebraic geometry known as resolution of singularities. Show that your answer from part b) matches Watanabe’s formula.
The general form of the asymptotic volume scaling formula is V(ϵ)∝ϵλ(−logϵ)m−1, where λ is the learning coefficient and m is the multiplicity. So far we have been able to ignore the multiplicity, because we have only considered examples where m=1. However, in this example, if k1=k2, then the multiplicity m=2. Show that the general form of the volume scaling formula holds when k1=k2.
*Repeat 3), but with the KL divergence K(a,b) from problem 5) instead. Vary a0, b0, k1, k2, and n, and see how the functions change. In particular, note what happens near a0b0=0. (Math3D is helpful here.)
Many practical machine learning models are conditional—they have inputs and outputs. Sometimes, as in the case of regression models, they may not naively appear to be probabilistic models at all. These properties are both demonstrated in univariate linear regression, a model given by y=mx+b, where x∈R is the input, m∈R is the slope parameter and b∈R is the intercept parameter. [Example 1.4, Gray Book].
Assume we are doing least-squares regression (the loss function is mean squared error loss). Show how to write this model as a probabilistic statistical model: that is, find a formula for the model’s conditional probability p(y|x,m,b) such that the negative log-likelihood yields mean squared error loss. (Hint.)
Assuming the true distribution q(y|x) is realizable with true slope m0 and true intercept b0, calculate the Fisher information matrix of this model. Use this to show that the model is regular.
*Consider a constant-width, two-layer linear network y=BAx, where x∈Rn are the inputs, y∈Rn are the outputs, and A,B∈Rn×n are linear transformations[5]. The parameter space is W=R2n2 and the parameters are the pair (A,B)∈W.
Show that this two-layer linear model is just as expressive as an ordinary linear model; they both can express all linear functions from Rn to Rn.
Let us examine the singular directions of this model; that is, directions in parameter space which do not change the function implemented by the model, to first order. More precisely, the parameter-function map can be seen as a (nonlinear) map f:R2n2→Rn×n, from parameters in R2n2 to linear transformations Rn×n. A tangent direction δw in parameter space is a singular direction at a parameter w if the directional derivative Dδwf(w)=0.[6]
Assume A and B are full rank. What are the singular directions?
Assume A and/or B are not full rank. What are the singular directions now? How does the number of singular directions depend on the rank of A and B?
Interpret these results. How does this make the two-layer linear model differ from the ordinary linear model?
Note that the condition that A and/or B are not full rank is zero measure in the parameter space, which could lead to the interpretation that such a scenario will never occur in practice, and we should ignore the new singular directions that occur here. Why is this interpretation misleading? What happens if A and/or B are almost not full rank (have very small eigenvalues)?
In the realizable case, Aoyagi & Watanabe (2005) derived a formula for the learning coefficient λ, in terms of the rank and dimensions of the true parameters A0 and B0 [Main Theorem].
Aoyagi & Watanabe do not assume the network is constant-width, like we do. Simplify their formula for λ using this assumption.
How does the learning coefficient λ depend on r, the rank of B0A0? Compare with the results of part b).
Repeat b.iii), but with λ instead.
Repeat b.iv), but with λ instead. (Hint: see problem 2d).)
The learning coefficient is a fairly global property of the KL divergence; in many cases, it is helpful to have a more local quantity. The local learning coefficient (LLC) of a parameter w∗ is defined to be the learning coefficient of the model with the support of the prior φ(w) restricted to an arbitrarily small neighborhood of w∗. For illustration, we will consider the parameterized normal model p(x|μ)=1√2πexp(−12(x−(μ(μ−2)2))2) for true parameter μ0=0.
Find and plot K(μ). Qualitatively, how does K(μ) look near μ=0 versus μ=2?
Compute and compare the LLC of μ∗=0 and μ∗=2. How do these values formalize the qualitative intuition found via a)?
Compute the (global) learning coefficient of the model. In general, the global learning coefficient is the minimum of the local learning coefficients for the minima of K(μ). Confirm that this holds here.
The global learning coefficient is the more important invariant for Bayesian learning, due to e.g. the free energy formula; why might the local learning coefficient be better suited for describing the complexity of a model where only a point value of w is estimated (e.g. using SGD)?
*Up to this point, we have implicitly assumed that we have direct access to the true distribution q(x), for instance when calculating the learning coefficient using the KL divergence K(w). However, in practice, we can only access the true distribution indirectly, via n samples Dn={X1,X2,…,Xn} drawn IID from q(x). This is a significant gap. Still, much like e.g. air resistance in physics, moving from idealized population quantities to realistic empirical quantities adds new complications, but much of the fundamental intuitions continue to hold.
We may use the negative log-likelihood Ln(w)=−∑ilogp(Xi|w)[7] as a proxy for the KL divergence K(w)=∫q(x)log(q(x)p(x|w))dx. Prove that EDn[Ln(w)]=K(w)+S, where S=−∫q(x)log(q(x))dx is the (constant) entropy of the true distribution.[8]
Take the model from 2) with μ0=0 and numerically sample from q(x) for n=10, n=100, and n=1000.
Plot Ln(μ) compared to K(μ)+S. How do the two differ, and what changes as n increases?
Plot the unnormalized Bayesian posterior[9]e−nLn(μ) compared to e−nK(μ). How do the two differ, and what changes as n increases? It is much easier to reason about e−nK(μ) theoretically compared to e−nLn(μ) - what does this suggest we can use e−nK(μ) for?
Take the model from 2) with μ0=0 and numerically compute V(ϵ) as a function of ϵ using K(μ). Verify that the proportionality V(ϵ)∝ϵλ holds as expected, by plotting log(V(ϵ)) versus ϵ.
Define the empirical version of V(ϵ) as Vn(ϵ)=∫{μ∣(Ln(μ)−minLn(μ))<ϵ}φ(μ)dμ. Repeat c), but with Vn(ϵ) instead of V(ϵ). How do the results differ? How well does the proportionality Vn(ϵ)∝ϵλ hold for larger ϵ, and how well does it hold for smaller ϵ? Based on what you saw in part b), give an intuitive explanation for why Vn(ϵ) behaves differently for small ϵ.
The tempered Bayesian posterior is the distribution p(w|Dn)=1Zn(∏ip(Xi|w)β)φ(w)=1Zne−nβLn(w)φ(w), where β is a parameter known as the inverse temperature, and Zn=∫e−nβLn(w)φ(w)dw is the normalizing constant known as the partition function or marginal likelihood. Note that the tempered Bayesian posterior results in the ordinary Bayesian posterior when β=1. Qualitatively, describe what increasing β does to the tempered Bayesian posterior. How does it differ from the effect of increasing n?
The free energy or log marginal likelihood is the quantity Fn(β)=−log∫(∏ip(Xi|w)β)φ(w)dw=−log∫e−nβLn(w)φ(w)dw obtained as the log of marginalizing the tempered Bayesian posterior at inverse temperature β. Perhaps the most central result of SLT is the asymptotic expansion of the free energy: Fn=nβSn+λlogn+O(loglog(n)), where the empirical entropySn=−1n∑ilog(q(Xi)) [Gray Book, Main Formula II].
Interpret the meaning of the free energy from the definition (it is the negative log of the marginal likelihood p(Dn)). Why is it advantageous to choose a model with lower free energy on a given dataset?
Take the model from 2) with μ0=0, β=1, and numerically compute the free energy for many values of n between n=100 and n=10000. Compare the calculated value to the estimate Fn≈nβSn+λlogn.
Repeat b), but with μ0=5.
Repeat b), but with μ0=0.1. Technically, the learning coefficient at μ0=0.1≠0 is still λ=12, the same as part c), but note how the behavior of the free energy seems closer to b) than c) for low n. The effects of the singularity at μ=0 extend to the neighborhood around it.
The local free energy of a subset of parameter space W⊂W is the free energy of the model with the parameter space restricted to W. We show how phase transitions can occur when different local free energies change at different rates.
Suppose we partition the overall parameter space into two subsets, W1 and W2. Denote the overall free energy by F(n) and the local free energies by F1(n) and F2(n), respectively. Show that the relationship F(n)=−log(e−F1(n)+e−F2(n)) holds.
Suppose F1(n)≈0.3n+20logn and F2(n)≈0.5n+2logn. Plot the overall free energy F(n). Compare against min{F1(n),F2(n)}. What happens around n=570?
In statistical physics, a phase transition is traditionally defined as a discontinuity (or rapid change, for finite-size systems) in the derivatives of the free energy. Plot ddnF(n) and explain why this justifies calling the phenomenon from b) a phase transition.
Change the coefficients of the terms in b). How does this change when a phase transition occurs? Does a phase transition always occur?
We sometimes informally refer to singular models as “models where you can (to first order) move in some direction in parameter space without changing the function implemented by the model.” However, singular models are defined in terms of the Fisher information matrix; it may not be immediately clear how the informal statement follows. Suppose we have a (potentially nonlinear) regression model given by a map f:W→F from a parameter space W=Rd to a function space F with outputs in Rn, for which we use mean squared error loss.[10] We may write this as a statistical model: p(y|x,w)=N(f(w)(x),I)=(2π)−d/2exp(−12||y−f(w)(x)||2) where N denotes the multivariate normal distribution and I is the identity matrix [See here].
Show that ∇wlog(p(y|x,w))=−(y−f(w)(x))T⋅Jf(w)(x) where Jf(w)(x) is the Jacobian of f(w)(x) with respect to w.
Suppose that the model is realizable such that the true parameter is w; that is, q(x,y)=p(y|x,w)q(x) for true input distribution q(x). Prove that the Fisher information matrix is given by I(w)=Ex∼q(x)[Jf(w)(x)TJf(w)(x)].
Prove that a vector v is in the null space of the Fisher information matrix I(w) if and only if it is in the null space of Jf(w)(x) for all x in the support of q(x). Conclude that a regression model is singular at a parameter w∈W if and only if there exists a vector v∈W such that the directional derivative ∇vf(w)(x)=0 for all inputs x in the support of q(x).
In the case where the support of q(x) is the entire input space of F, show that the null space of the Fisher information matrix I(w) is equal to the null space of the total derivative Df.
The naive method of determining “how singular a model is” at a particular parameter w is by counting the number of singular directions at w; more precisely, using the rank of the Fisher information matrix[11] at w.
Construct two statistical models where the rank of the Fisher information matrix at the true parameter w0 is the same, but the learning coefficient differs between the two. (Hint: modify the model from problem 2).) In this example, what information is the learning coefficient giving you that is missing from the rank of the Fisher information matrix?
The rank of the Fisher information matrix does give some information, however. Show that if the rank of the Fisher information matrix at w0 is r, the learning coefficient satisfies the lower bound λ≥r2. [Hard.]
In practice, the Hessian of the loss function is a more directly visible quantity than the Fisher information matrix; however, the two are closely connected.
Recall the population loss is given by L(w)=−∫q(x)logp(x|w)dw. Suppose that the true distribution is realizable with true parameter w0. Prove that the Hessian of the population loss at w0 is equal to the Fisher information matrix at w0.
Recall the empirical loss is given by L(w)=−∑xilogp(xi|w), where xi∼q(x) IID. The empirical Fisher information matrix is given by In(w)=∑xi∇logp(xi|w)∇logp(xi|w)T. Suppose that the true distribution is realizable with true parameter w0. Prove that the Hessian of the empirical loss at w0 is equal to the empirical Fisher information matrix at w0.
Suppose that a direction δw∈TW is singular at a parameter w∈W - that is, δw lies in the null space of the Fisher information matrix at w. Prove that it also lies in the null space of the population Hessian and empirical Hessian at w. (Note that unlike a) or b), you do not need to assume realizability or even that w is a local minimum of the loss.)
If one finds empirically that the loss Hessian has many zero eigenvalues at a parameter, what does this suggest about the Fisher information matrix at that parameter, and consequently about how singular the model is?
Here we prove that the learning coefficient is invariant to diffeomorphism; that is, roughly, a smooth and invertible change of variables. A diffeomorphism is an invertible map ϕ:W1→W2 between two spaces W1,W2 such that both W1 and W2 are infinitely differentiable.
Show that ϕ1(x)=3x+2 is a diffeomorphism from R to R, and ϕ2(x)=sin(x) is a diffeomorphism from (−π2,π2) to (−1,1). Explain why ϕ2(x) is not a diffeomorphism from [−π2,π2) to [−1,1).
Let p(x|w2) be a model defined on the parameter space W2. We apply a diffeomorphism ϕ:W1→W2 to create the model p(x|ϕ(w1)) defined on the parameter space W1. Abusing notation, let K(w1)=KL(q(x)||p(x|ϕ(w1))) denote the KL divergence of the model on W1 and K(w2)=KL(q(x)||p(x|w2)) the KL divergence of the model on W2. Define V1(ϵ)=∫{w1|K(w1)<ϵ}dw1 and V2(ϵ)=∫{w2|K(w2)<ϵ}dw2. Rewrite V1(ϵ) as an integral over W2 using the change of variables formula.
Using the result from part b) together with the fact that ϕ is a diffeomorphism, show that there exist positive constants c1,c2 such that c1V2(ϵ)≤V1(ϵ)≤c2V2(ϵ).
Using the result from c) together with the volume scaling formula for the learning coefficient λ and its multiplicity m, prove that V1(ϵ)∝V2(ϵ)∝ϵλ(logϵ)m−1 as ϵ→0. Conclude that the learning coefficient and multiplicity of the two models is the same, and thus invariant to diffeomorphism.
The role of the partition function Zn(β) and free energy Fn(β) can be partially explained by their role as moment and cumulant generating functions of the posterior loss. This connection also leads to the scalable estimators of the free energy and learning coefficient used in practice. Let the random variable ~wβ be a sample from the tempered posterior ~wβ∼1Zne−nβLn(w)φ(w). Then Ln(~wβ) is a real-valued random variable giving the negative log-likelihood of a randomly sampled parameter from the tempered posterior.
Recall that the moment generating function of a random variable X with PDF p(x) is given by MX(s)=E[esX]=∫∞−∞esxp(x)dx, and the partition function is given by Zn(β)=∫e−nβLn(w)φ(w)dw.
Show that MnLn(~wβ)(s)=Zn(β−s)Zn(β).
The i-th moment mi of X is given by mi=didisMX(s)|s=0. Compute mi for nLn(~wβ).
Recall the cumulant generating function of a random variable X with PDF p(x) is given by KX(s)=logMX(s)=log∫∞−∞esxp(x)dx, and the free energy is given by Fn(β)=−log∫e−nβLn(w)φ(w)dw.
Show that KnLn(~wβ)(s)=Fn(β)−Fn(β−s).
The i-th cumulant κi of X is given by κi=didsiKX(s)|s=0. Show that κi=(−1)i+1didis|s=βFn(s) for nLn(~wβ).
We may use b.ii) to estimate Fn(1) or λ empirically from samples, far more efficiently than naive methods. The estimator of Fn(1) is known as the WBIC.
Using b.ii), conclude that E~wβ[nLn(~wβ)]=dds|s=βFn(s).
The free energy formula can be written as Fn(β)≈nβSn+λlog(nβ). There exists an optimal inverse temperatureβ∗ such that E~wβ∗[nLn(~wβ∗)]=Fn(1). Find an approximate expression for β∗ using i) and assuming that the free energy formula holds after differentiating both sides, i.e.ddβFn(β)≈ddβ(nβSn+λlog(nβ)).
Instead of estimating Fn(β), we may estimate the learning coefficient λ directly. Use i) and the free energy formula to get an approximate expression for λ in terms of β, Sn, and E~wβ[nLn(~wβ)]. Is it still necessary to set β to an optimal inverse temperature for this estimator to work?
A priori, it seems like the choice of prior distribution φ(w) could significantly affect the learning coefficient we get, via the volume scaling formula or otherwise. However, this is not the case—the learning coefficient remains the same so long as the prior is supported at μ0 [Gray Book, Remark 6.7]. Thus, the prior has at most a secondary role in singular learning theory. To demonstrate this, redo 1c) and 1d) but with a different prior distribution that also has support at μ0.
References
[The “Gray Book”] Algebraic Geometry and Statistical Learning Theory (Watanabe 2009)
[The “Green Book”] Mathematical Theory of Bayesian Statistics (Watanabe 2018)
Stochastic complexities of reduced rank regression in Bayesian estimation (Aoyagi & Watanabe 2005)
Technically, if we follow Watanabe’s terminology, a regular model is non-strictly-singular rather than non-singular. Watanabe defines singular models as a superset of regular models, so every regular model is also singular—non-regular models are referred to as strictlysingular models.
Technically, the learning coefficient is not the same thing as the real log-canonical threshold (RLCT); the learning coefficient is an invariant of a statistical system (model, truth, prior triplet), whereas the RLCT is an invariant of an analytic function. However, the RLCT of the model/truth KL divergence coincides with the learning coefficient if the prior is supported along the true parameter set.
Note that in practice, analytically calculating or numerically estimating the learning coefficient directly via this volume scaling formula is completely intractable. Instead, methods based on the WBIC and MCMC are necessary.
Note that this pointwise expected value only tells us about Ln(w) for a fixed w; it does not give us enough information to talk about properties of Ln(w) which depend on many w, like volume scaling, the free energy, etc. Establishing this is highly nontrivial, and the Gray Book spends a significant amount of time doing so.
Singular learning theory: exercises
Thanks to Jesse Hoogland and George Wang for feedback on these exercises.
In learning singular learning theory (SLT), I found it was often much easier to understand by working through examples, rather than try to work through the (fairly technical) theorems in their full generality. These exercises are an attempt to collect the sorts of examples that I worked through to understand SLT.
Before doing these exercises, you should have read the Distilling Singular Learning Theory (DSLT) sequence, watched the SLT summit YouTube videos, or studied something equivalent. DSLT is a good reference to keep open while solving these problems, perhaps alongside Watanabe’s textbook, the Gray Book. Note that some of these exercises cover the basics, which are well-covered in the above distillations, but some deliver material which will likely be new to you (because it’s buried deep in a textbook, because it’s only found in adjacent literature, etc).
Exercises are presented mostly in conceptual order: later exercises freely use concepts developed in earlier exercises. Starred (*) exercises are what I consider the most essential exercises, and the ones I recommend you complete first.
*The normal distribution, like most classical statistical models, is a regular (i.e. non-singular[1]) statistical model. A univariate normal model with unit variance and mean μ∈R is given by the probability density p(x|μ)=1√2πexp(−12(x−μ)2). Assume the true distribution q(x) of the data is realizable by the model: that is, q(x)=p(x|μ0) for some true parameter μ0.
Calculate the Fisher information matrix of this model (note that since we have only a single parameter, the FIM will be a 1x1 matrix). Use this to show the model is regular.
Write an explicit expression for the KL divergence K(μ) between q(x) and p(x|μ), as a function of the parameter μ. This quantity is sometimes also called the population loss. [See Example 1.1, Gray Book, for the case of a 2D normal distribution]
Using K(μ) from b), give an explicit formula for the volume of “almost optimal” parameters, V(ϵ)=∫{μ∣K(μ)<ϵ}φ(μ)dμ, where φ(μ) is the prior distribution. For convenience, let φ(μ) be the improper prior φ(μ)=1.
The volume scaling formula for the learning coefficient λ (also known as RLCT[2]) is λ=limϵ→0log(V(aϵ)/V(ϵ))log(a) for any a≠1 [Theorem 7.1, Gray Book]. Using this formula, combined with the expression for V(ϵ) derived in b), calculate the learning coefficient[3]. Given that the model is regular, we expect the learning coefficient to be d2=12; compare your answer.
*We can make the normal distribution a singular model by changing the parameterization. Let a cubicly-parameterized normal model be the model p(x|μ)=1√2πexp(−12(x−μ3)2). Assume the true parameter is μ0.
Show that the cubicly-parameterized normal model is just as expressive as an ordinary normal model: that is, they both can express all univariate normal distributions.
Repeat 1a) with this model; calculate the Fisher information matrix to demonstrate that the model is singular, and find which parameters μ are singular.
Repeat 1b) − 1d) to calculate the learning coefficient this model, for μ0=0, and for μ0≠0.
Recall that the learning coefficient is a volume scaling exponent, such that V(ϵ)∝ϵλ [4] as ϵ→0. Based on this, interpret your results. How does this make the cubicly-parameterized normal model different from the ordinary normal model?
Instead of taking ϵ→0 to get the learning coefficient, fix a small but nonzero value for ϵ, such as ϵ=0.01. Plot V(ϵ) as a function of μ0. As we saw from c), the learning coefficient changes discontinuously when μ0=0 - what happens with V(ϵ) as μ0 gets close to zero? What changes if you make ϵ smaller or larger?
Even though the asymptotic learning coefficient (ϵ→0) only changes when μ0=0 exactly, note how the non-asymptotic volume (ϵ finite) is affected in a larger neighborhood.
*For each of the two different K(μ) formulas obtained in 1b) and 2c), create plots of K(μ) and e−nK(μ), with μ0 and n as variable sliders (e.g. using Desmos). Note how these functions change as μ0 and n change. Compare results between the K(μ) obtained in 1b) and the K(μ) obtained in 2c).
Consider yet another alternative parameterization of the normal distribution, this time with two parameters: p(x|a,b)=1√2πexp(−12(x−a−b)2). Assume the true distribution is realizable with true parameters a0 and b0.
This type of model is called minimally singular, for reasons that will become clear shortly. Minimally singular models can be dealt with easier than more general singular models.
Show that this model is just as expressive as an ordinary normal model; they both can express all univariate normal distributions.
Calculate the Fisher information matrix of the model and show that the model is singular. Further, calculate the eigenvectors of the matrix, and interpret them.
Calculate the learning coefficient using the same procedure as 2c). Note that you will need a proper prior φ(a,b) this time, in order for the integral to converge: a uniform prior over the unit square should work, i.e. φ(a,b)={0.25 if |a|≤1,|b|≤1;0 otherwise}.
As we saw from part b), this model has one direction in parameter space which changes the distribution p(x|μ), and one direction in parameter space which does not—and this is true globally. It seems sensible to just ignore the direction in parameter space which is globally redundant. Apply the formula for the learning coefficient of regular models, but use d=1 instead of d=2. Compare your result to part c).
It turns out that the strategy from part d), merely ignoring or quotienting out redundant submanifolds, works for any minimally singular model. However, many singular models, including neural networks, are not minimally singular. Why does this strategy not work for singular models in general? What would stop you from applying it to the model in problem 2), for instance?
Consider a two-parameter alternatively-parameterized normal distribution: p(x|a,b)=1√2πexp(−12(x−ak1bk2)2), where k1,k2∈N are fixed constants. Assume the true distribution is realizable with true parameters a0 and b0.
Calculate the Fisher information matrix of the model. For what parameters is the model singular? For singular parameters, when is the rank of the matrix zero, and when is it one?
Compute the learning coefficient using the same procedure as 1b), for a0b0=0.
In the Gray Book, Watanabe proves that any model with KL divergence of the form K(w1,w2,…,wr)=w12k1w22k2…wr2kr, known as normal crossing form, has learning coefficient λ=12max({k1,k2,…,k3}) (in the case where the prior is non-vanishing). This result is more than a special case: Watanabe showed that any statistical model may (locally) be converted into normal crossing form, via a change of variables from algebraic geometry known as resolution of singularities.
Show that your answer from part b) matches Watanabe’s formula.
The general form of the asymptotic volume scaling formula is V(ϵ)∝ϵλ(−logϵ)m−1, where λ is the learning coefficient and m is the multiplicity. So far we have been able to ignore the multiplicity, because we have only considered examples where m=1. However, in this example, if k1=k2, then the multiplicity m=2. Show that the general form of the volume scaling formula holds when k1=k2.
*Repeat 3), but with the KL divergence K(a,b) from problem 5) instead. Vary a0, b0, k1, k2, and n, and see how the functions change. In particular, note what happens near a0b0=0. (Math3D is helpful here.)
Many practical machine learning models are conditional—they have inputs and outputs. Sometimes, as in the case of regression models, they may not naively appear to be probabilistic models at all. These properties are both demonstrated in univariate linear regression, a model given by y=mx+b, where x∈R is the input, m∈R is the slope parameter and b∈R is the intercept parameter. [Example 1.4, Gray Book].
Assume we are doing least-squares regression (the loss function is mean squared error loss). Show how to write this model as a probabilistic statistical model: that is, find a formula for the model’s conditional probability p(y|x,m,b) such that the negative log-likelihood yields mean squared error loss. (Hint.)
Assuming the true distribution q(y|x) is realizable with true slope m0 and true intercept b0, calculate the Fisher information matrix of this model. Use this to show that the model is regular.
*Consider a constant-width, two-layer linear network y=BAx, where x∈Rn are the inputs, y∈Rn are the outputs, and A,B∈Rn×n are linear transformations[5]. The parameter space is W=R2n2 and the parameters are the pair (A,B)∈W.
Show that this two-layer linear model is just as expressive as an ordinary linear model; they both can express all linear functions from Rn to Rn.
Let us examine the singular directions of this model; that is, directions in parameter space which do not change the function implemented by the model, to first order. More precisely, the parameter-function map can be seen as a (nonlinear) map f:R2n2→Rn×n, from parameters in R2n2 to linear transformations Rn×n. A tangent direction δw in parameter space is a singular direction at a parameter w if the directional derivative Dδwf(w)=0.[6]
Assume A and B are full rank. What are the singular directions?
Assume A and/or B are not full rank. What are the singular directions now? How does the number of singular directions depend on the rank of A and B?
Interpret these results. How does this make the two-layer linear model differ from the ordinary linear model?
Note that the condition that A and/or B are not full rank is zero measure in the parameter space, which could lead to the interpretation that such a scenario will never occur in practice, and we should ignore the new singular directions that occur here. Why is this interpretation misleading? What happens if A and/or B are almost not full rank (have very small eigenvalues)?
In the realizable case, Aoyagi & Watanabe (2005) derived a formula for the learning coefficient λ, in terms of the rank and dimensions of the true parameters A0 and B0 [Main Theorem].
Aoyagi & Watanabe do not assume the network is constant-width, like we do. Simplify their formula for λ using this assumption.
How does the learning coefficient λ depend on r, the rank of B0A0? Compare with the results of part b).
Repeat b.iii), but with λ instead.
Repeat b.iv), but with λ instead. (Hint: see problem 2d).)
The learning coefficient is a fairly global property of the KL divergence; in many cases, it is helpful to have a more local quantity. The local learning coefficient (LLC) of a parameter w∗ is defined to be the learning coefficient of the model with the support of the prior φ(w) restricted to an arbitrarily small neighborhood of w∗.
For illustration, we will consider the parameterized normal model p(x|μ)=1√2πexp(−12(x−(μ(μ−2)2))2)
for true parameter μ0=0.
Find and plot K(μ). Qualitatively, how does K(μ) look near μ=0 versus μ=2?
Compute and compare the LLC of μ∗=0 and μ∗=2. How do these values formalize the qualitative intuition found via a)?
Compute the (global) learning coefficient of the model. In general, the global learning coefficient is the minimum of the local learning coefficients for the minima of K(μ). Confirm that this holds here.
The global learning coefficient is the more important invariant for Bayesian learning, due to e.g. the free energy formula; why might the local learning coefficient be better suited for describing the complexity of a model where only a point value of w is estimated (e.g. using SGD)?
*Up to this point, we have implicitly assumed that we have direct access to the true distribution q(x), for instance when calculating the learning coefficient using the KL divergence K(w). However, in practice, we can only access the true distribution indirectly, via n samples Dn={X1,X2,…,Xn} drawn IID from q(x). This is a significant gap.
Still, much like e.g. air resistance in physics, moving from idealized population quantities to realistic empirical quantities adds new complications, but much of the fundamental intuitions continue to hold.
We may use the negative log-likelihood Ln(w)=−∑ilogp(Xi|w)[7] as a proxy for the KL divergence K(w)=∫q(x)log(q(x)p(x|w))dx. Prove that EDn[Ln(w)]=K(w)+S, where S=−∫q(x)log(q(x))dx is the (constant) entropy of the true distribution.[8]
Take the model from 2) with μ0=0 and numerically sample from q(x) for n=10, n=100, and n=1000.
Plot Ln(μ) compared to K(μ)+S. How do the two differ, and what changes as n increases?
Plot the unnormalized Bayesian posterior[9] e−nLn(μ) compared to e−nK(μ). How do the two differ, and what changes as n increases? It is much easier to reason about e−nK(μ) theoretically compared to e−nLn(μ) - what does this suggest we can use e−nK(μ) for?
Take the model from 2) with μ0=0 and numerically compute V(ϵ) as a function of ϵ using K(μ). Verify that the proportionality V(ϵ)∝ϵλ holds as expected, by plotting log(V(ϵ)) versus ϵ.
Define the empirical version of V(ϵ) as Vn(ϵ)=∫{μ∣(Ln(μ)−minLn(μ))<ϵ}φ(μ)dμ. Repeat c), but with Vn(ϵ) instead of V(ϵ). How do the results differ? How well does the proportionality Vn(ϵ)∝ϵλ hold for larger ϵ, and how well does it hold for smaller ϵ? Based on what you saw in part b), give an intuitive explanation for why Vn(ϵ) behaves differently for small ϵ.
The tempered Bayesian posterior is the distribution p(w|Dn)=1Zn(∏ip(Xi|w)β)φ(w)=1Zne−nβLn(w)φ(w), where β is a parameter known as the inverse temperature, and Zn=∫e−nβLn(w)φ(w)dw is the normalizing constant known as the partition function or marginal likelihood. Note that the tempered Bayesian posterior results in the ordinary Bayesian posterior when β=1. Qualitatively, describe what increasing β does to the tempered Bayesian posterior. How does it differ from the effect of increasing n?
The free energy or log marginal likelihood is the quantity Fn(β)=−log∫(∏ip(Xi|w)β)φ(w)dw=−log∫e−nβLn(w)φ(w)dw
obtained as the log of marginalizing the tempered Bayesian posterior at inverse temperature β. Perhaps the most central result of SLT is the asymptotic expansion of the free energy:
Fn=nβSn+λlogn+O(loglog(n)),
where the empirical entropy Sn=−1n∑ilog(q(Xi)) [Gray Book, Main Formula II].
Interpret the meaning of the free energy from the definition (it is the negative log of the marginal likelihood p(Dn)). Why is it advantageous to choose a model with lower free energy on a given dataset?
Take the model from 2) with μ0=0, β=1, and numerically compute the free energy for many values of n between n=100 and n=10000. Compare the calculated value to the estimate Fn≈nβSn+λlogn.
Repeat b), but with μ0=5.
Repeat b), but with μ0=0.1. Technically, the learning coefficient at μ0=0.1≠0 is still λ=12, the same as part c), but note how the behavior of the free energy seems closer to b) than c) for low n. The effects of the singularity at μ=0 extend to the neighborhood around it.
The local free energy of a subset of parameter space W⊂W is the free energy of the model with the parameter space restricted to W. We show how phase transitions can occur when different local free energies change at different rates.
Suppose we partition the overall parameter space into two subsets, W1 and W2. Denote the overall free energy by F(n) and the local free energies by F1(n) and F2(n), respectively. Show that the relationship F(n)=−log(e−F1(n)+e−F2(n)) holds.
Suppose F1(n)≈0.3n+20logn and F2(n)≈0.5n+2logn. Plot the overall free energy F(n). Compare against min{F1(n),F2(n)}. What happens around n=570?
In statistical physics, a phase transition is traditionally defined as a discontinuity (or rapid change, for finite-size systems) in the derivatives of the free energy. Plot ddnF(n) and explain why this justifies calling the phenomenon from b) a phase transition.
Change the coefficients of the terms in b). How does this change when a phase transition occurs? Does a phase transition always occur?
We sometimes informally refer to singular models as “models where you can (to first order) move in some direction in parameter space without changing the function implemented by the model.” However, singular models are defined in terms of the Fisher information matrix; it may not be immediately clear how the informal statement follows.
Suppose we have a (potentially nonlinear) regression model given by a map f:W→F from a parameter space W=Rd to a function space F with outputs in Rn, for which we use mean squared error loss.[10] We may write this as a statistical model: p(y|x,w)=N(f(w)(x),I)=(2π)−d/2exp(−12||y−f(w)(x)||2)
where N denotes the multivariate normal distribution and I is the identity matrix [See here].
Show that ∇wlog(p(y|x,w))=−(y−f(w)(x))T⋅Jf(w)(x) where Jf(w)(x) is the Jacobian of f(w)(x) with respect to w.
Suppose that the model is realizable such that the true parameter is w; that is, q(x,y)=p(y|x,w)q(x) for true input distribution q(x). Prove that the Fisher information matrix is given by I(w)=Ex∼q(x)[Jf(w)(x)TJf(w)(x)].
Prove that a vector v is in the null space of the Fisher information matrix I(w) if and only if it is in the null space of Jf(w)(x) for all x in the support of q(x).
Conclude that a regression model is singular at a parameter w∈W if and only if there exists a vector v∈W such that the directional derivative ∇vf(w)(x)=0 for all inputs x in the support of q(x).
In the case where the support of q(x) is the entire input space of F, show that the null space of the Fisher information matrix I(w) is equal to the null space of the total derivative Df.
The naive method of determining “how singular a model is” at a particular parameter w is by counting the number of singular directions at w; more precisely, using the rank of the Fisher information matrix[11] at w.
Construct two statistical models where the rank of the Fisher information matrix at the true parameter w0 is the same, but the learning coefficient differs between the two. (Hint: modify the model from problem 2).)
In this example, what information is the learning coefficient giving you that is missing from the rank of the Fisher information matrix?
The rank of the Fisher information matrix does give some information, however. Show that if the rank of the Fisher information matrix at w0 is r, the learning coefficient satisfies the lower bound λ≥r2. [Hard.]
In practice, the Hessian of the loss function is a more directly visible quantity than the Fisher information matrix; however, the two are closely connected.
Recall the population loss is given by L(w)=−∫q(x)logp(x|w)dw. Suppose that the true distribution is realizable with true parameter w0. Prove that the Hessian of the population loss at w0 is equal to the Fisher information matrix at w0.
Recall the empirical loss is given by L(w)=−∑xilogp(xi|w), where xi∼q(x) IID. The empirical Fisher information matrix is given by In(w)=∑xi∇logp(xi|w)∇logp(xi|w)T. Suppose that the true distribution is realizable with true parameter w0. Prove that the Hessian of the empirical loss at w0 is equal to the empirical Fisher information matrix at w0.
Suppose that a direction δw∈TW is singular at a parameter w∈W - that is, δw lies in the null space of the Fisher information matrix at w. Prove that it also lies in the null space of the population Hessian and empirical Hessian at w. (Note that unlike a) or b), you do not need to assume realizability or even that w is a local minimum of the loss.)
If one finds empirically that the loss Hessian has many zero eigenvalues at a parameter, what does this suggest about the Fisher information matrix at that parameter, and consequently about how singular the model is?
Here we prove that the learning coefficient is invariant to diffeomorphism; that is, roughly, a smooth and invertible change of variables. A diffeomorphism is an invertible map ϕ:W1→W2 between two spaces W1,W2 such that both W1 and W2 are infinitely differentiable.
Show that ϕ1(x)=3x+2 is a diffeomorphism from R to R, and ϕ2(x)=sin(x) is a diffeomorphism from (−π2,π2) to (−1,1). Explain why ϕ2(x) is not a diffeomorphism from [−π2,π2) to [−1,1).
Let p(x|w2) be a model defined on the parameter space W2. We apply a diffeomorphism ϕ:W1→W2 to create the model p(x|ϕ(w1)) defined on the parameter space W1. Abusing notation, let K(w1)=KL(q(x)||p(x|ϕ(w1))) denote the KL divergence of the model on W1 and K(w2)=KL(q(x)||p(x|w2)) the KL divergence of the model on W2. Define V1(ϵ)=∫{w1|K(w1)<ϵ}dw1 and V2(ϵ)=∫{w2|K(w2)<ϵ}dw2. Rewrite V1(ϵ) as an integral over W2 using the change of variables formula.
Using the result from part b) together with the fact that ϕ is a diffeomorphism, show that there exist positive constants c1,c2 such that c1V2(ϵ)≤V1(ϵ)≤c2V2(ϵ).
Using the result from c) together with the volume scaling formula for the learning coefficient λ and its multiplicity m, prove that V1(ϵ)∝V2(ϵ)∝ϵλ(logϵ)m−1 as ϵ→0. Conclude that the learning coefficient and multiplicity of the two models is the same, and thus invariant to diffeomorphism.
The role of the partition function Zn(β) and free energy Fn(β) can be partially explained by their role as moment and cumulant generating functions of the posterior loss. This connection also leads to the scalable estimators of the free energy and learning coefficient used in practice.
Let the random variable ~wβ be a sample from the tempered posterior ~wβ∼1Zne−nβLn(w)φ(w). Then Ln(~wβ) is a real-valued random variable giving the negative log-likelihood of a randomly sampled parameter from the tempered posterior.
Recall that the moment generating function of a random variable X with PDF p(x) is given by MX(s)=E[esX]=∫∞−∞esxp(x)dx, and the partition function is given by Zn(β)=∫e−nβLn(w)φ(w)dw.
Show that MnLn(~wβ)(s)=Zn(β−s)Zn(β).
The i-th moment mi of X is given by mi=didisMX(s)|s=0. Compute mi for nLn(~wβ).
Recall the cumulant generating function of a random variable X with PDF p(x) is given by KX(s)=logMX(s)=log∫∞−∞esxp(x)dx, and the free energy is given by Fn(β)=−log∫e−nβLn(w)φ(w)dw.
Show that KnLn(~wβ)(s)=Fn(β)−Fn(β−s).
The i-th cumulant κi of X is given by κi=didsiKX(s)|s=0. Show that κi=(−1)i+1didis|s=βFn(s) for nLn(~wβ).
We may use b.ii) to estimate Fn(1) or λ empirically from samples, far more efficiently than naive methods. The estimator of Fn(1) is known as the WBIC.
Using b.ii), conclude that E~wβ[nLn(~wβ)]=dds|s=βFn(s).
The free energy formula can be written as Fn(β)≈nβSn+λlog(nβ). There exists an optimal inverse temperature β∗ such that E~wβ∗[nLn(~wβ∗)]=Fn(1). Find an approximate expression for β∗ using i) and assuming that the free energy formula holds after differentiating both sides, i.e.ddβFn(β)≈ddβ(nβSn+λlog(nβ)).
Instead of estimating Fn(β), we may estimate the learning coefficient λ directly. Use i) and the free energy formula to get an approximate expression for λ in terms of β, Sn, and E~wβ[nLn(~wβ)]. Is it still necessary to set β to an optimal inverse temperature for this estimator to work?
A priori, it seems like the choice of prior distribution φ(w) could significantly affect the learning coefficient we get, via the volume scaling formula or otherwise. However, this is not the case—the learning coefficient remains the same so long as the prior is supported at μ0 [Gray Book, Remark 6.7]. Thus, the prior has at most a secondary role in singular learning theory.
To demonstrate this, redo 1c) and 1d) but with a different prior distribution that also has support at μ0.
References
[The “Gray Book”] Algebraic Geometry and Statistical Learning Theory (Watanabe 2009)
[The “Green Book”] Mathematical Theory of Bayesian Statistics (Watanabe 2018)
Stochastic complexities of reduced rank regression in Bayesian estimation (Aoyagi & Watanabe 2005)
Technically, if we follow Watanabe’s terminology, a regular model is non-strictly-singular rather than non-singular. Watanabe defines singular models as a superset of regular models, so every regular model is also singular—non-regular models are referred to as strictly singular models.
Technically, the learning coefficient is not the same thing as the real log-canonical threshold (RLCT); the learning coefficient is an invariant of a statistical system (model, truth, prior triplet), whereas the RLCT is an invariant of an analytic function. However, the RLCT of the model/truth KL divergence coincides with the learning coefficient if the prior is supported along the true parameter set.
Note that in practice, analytically calculating or numerically estimating the learning coefficient directly via this volume scaling formula is completely intractable. Instead, methods based on the WBIC and MCMC are necessary.
In the case where the multiplicity m=1.
This kind of two-layer linear model is sometimes called reduced rank regression.
By 14d), this is equivalent to the null space of the FIM.
Note that the Gray Book uses Ln(w) to refer to the likelihood instead of the negative log-likelihood.
Note that this pointwise expected value only tells us about Ln(w) for a fixed w; it does not give us enough information to talk about properties of Ln(w) which depend on many w, like volume scaling, the free energy, etc. Establishing this is highly nontrivial, and the Gray Book spends a significant amount of time doing so.
Implicitly, with improper prior φ(w)=1, but the prior isn’t important for intuition here.
A neural network under mean squared error loss would satisfy these assumptions.
In some cases, the rank of the Hessian may also be used here, given the correspondence between the FIM and the Hessian (see problem 16).