In this shortform, I very briefly explain my understanding of how zeta functions play a role in the derivation of the free energy in singular learning theory. This is entirely based on slide 14 of the SLT low 4 talk of the recent summit on SLT and Alignment, so feel free to ignore this shortform and simply watch the video.
The story is this: we have a prior φ(w), a model p(x∣w), and there is an unknown true distribution q(x). For model selection, we are interested in the evidence of our model for a data set Dn={x1,…,xn}, which is given by
where Kn(w)=1n∑ni=1logq(xi)p(xi∣w) is the empirical KL divergence. In fact, we are interested in selecting the model that maximizes the average of this quantity over all data sets. The average is then given by
¯¯¯¯Z(n):=EDn∼qn[Zn]=∫e−nK(w)φ(w)dw,
where K(w)=∫x∼qq(x)logq(x)p(x∣w) is the Kullback-Leibler divergence.
The answer: by computing a different integral. So now, I’ll explain the connection to different integrals we can draw.
Let
v(t):=∫δ(t−K(w))φ(w)dw,
which is called the state density function. Here, δ is the Dirac delta function. For different t, it measures the density of states (= parameter vectors) that have K(w)=t. It is thus a measure for the “size” of different level sets. This state density function is connected to two different things.
Laplace Transform to the Evidence
First of all, it is connected to the evidence above. Namely, let L(v) be the Laplace transform of v. It is a function L(v):N→R given by
In first step, we changed the order of integration, and in the second step we used the defining property of the Dirac delta. Great, so this tells us that L(v)=¯¯¯¯Z! So this means we essentially just need to understand v.
Mellin Transform to the Zeta Function
But how do we compute v? By using another transform. Let M(v) be the Mellin transform of v. It is a function M(v):C→C (or maybe only defined on part of C?) given by
Again, we used a change in the order of integration and then the defining property of the Dirac delta. This is called a Zeta function.
What’s this useful for?
The Mellin transform has an inverse. Thus, if we can compute the zeta function, we can also compute the original evidence as
¯¯¯¯Z=L(v)=L(M−1(M(v))).
Thus, we essentially changed our problem to the problem of studying the zeta function M(v). To compute the integral of the zeta function, it is then useful to perform blowups to resolve the singularities in the set of minima of K(w), which is where algebraic geometry enters the picture. For more on all of this, I refer, again, to the excellent SLT low 4 talk of the recent summit on singular learning theory.
Zeta Functions in Singular Learning Theory
In this shortform, I very briefly explain my understanding of how zeta functions play a role in the derivation of the free energy in singular learning theory. This is entirely based on slide 14 of the SLT low 4 talk of the recent summit on SLT and Alignment, so feel free to ignore this shortform and simply watch the video.
The story is this: we have a prior φ(w), a model p(x∣w), and there is an unknown true distribution q(x). For model selection, we are interested in the evidence of our model for a data set Dn={x1,…,xn}, which is given by
Zn=∫p(Dn∣w)φ(w)dw=∫e−nLn(w)φ(w)dw∝∫e−nKn(w)φ(w)dw,where Kn(w)=1n∑ni=1logq(xi)p(xi∣w) is the empirical KL divergence. In fact, we are interested in selecting the model that maximizes the average of this quantity over all data sets. The average is then given by
¯¯¯¯Z(n):=EDn∼qn[Zn]=∫e−nK(w)φ(w)dw,where K(w)=∫x∼qq(x)logq(x)p(x∣w) is the Kullback-Leibler divergence.
But now we have a problem: how do we compute this integral? Computing this integral is what the free energy formula is about.
The answer: by computing a different integral. So now, I’ll explain the connection to different integrals we can draw.
Let
v(t):=∫δ(t−K(w))φ(w)dw,which is called the state density function. Here, δ is the Dirac delta function. For different t, it measures the density of states (= parameter vectors) that have K(w)=t. It is thus a measure for the “size” of different level sets. This state density function is connected to two different things.
Laplace Transform to the Evidence
First of all, it is connected to the evidence above. Namely, let L(v) be the Laplace transform of v. It is a function L(v):N→R given by
[L(v)](n):=∫∞0e−ntv(t)dt=∫[∫∞0δ(t−K(w))e−ntdt]φ(w)dw=∫e−nK(w)φ(w)dw=¯¯¯¯Z(n).In first step, we changed the order of integration, and in the second step we used the defining property of the Dirac delta. Great, so this tells us that L(v)=¯¯¯¯Z! So this means we essentially just need to understand v.
Mellin Transform to the Zeta Function
But how do we compute v? By using another transform. Let M(v) be the Mellin transform of v. It is a function M(v):C→C (or maybe only defined on part of C?) given by
[M(v)](z):=∫∞0tzv(t)dt=∫[∫∞0δ(t−K(w))tzdt]φ(w)dw=∫K(w)zφ(w)dw.Again, we used a change in the order of integration and then the defining property of the Dirac delta. This is called a Zeta function.
What’s this useful for?
The Mellin transform has an inverse. Thus, if we can compute the zeta function, we can also compute the original evidence as
¯¯¯¯Z=L(v)=L(M−1(M(v))).Thus, we essentially changed our problem to the problem of studying the zeta function M(v). To compute the integral of the zeta function, it is then useful to perform blowups to resolve the singularities in the set of minima of K(w), which is where algebraic geometry enters the picture. For more on all of this, I refer, again, to the excellent SLT low 4 talk of the recent summit on singular learning theory.