Disclaimer: I’m by no means an expert on singular learning theory and what I present below is a simplification that experts might not endorse. Still, I think it might be more comprehensible for a general audience than going into digressions about blowing up singularities and birational invariants.
Here is my current understanding of what singular learning theory is about in a simplified (though perhaps more realistic?) discrete setting.
Suppose you represent a neural network architecture as a map A:2N→F where 2={0,1}, 2N is the set of all possible parameters of A (seen as floating point numbers, say) and F is the set of all possible computable functions from the input and output space you’re considering. In thermodynamic terms, we could identify elements of 2N as “microstates” and the corresponding functions that the NN architecture A maps them to as “macrostates”.
Furthermore, suppose that F comes together with a loss function L:F→R evaluating how good or bad a particular function is. Assume you optimize L using something like stochastic gradient descent on the function L with a particular learning rate.
Then, in general, we have the following results:
SGD defines a Markov chain structure on the space 2N whose stationary distribution is proportional to e−βL(A(θ)) on parameters θ for some positive constant β>0 that depends on the learning rate. This is just a basic fact about the Langevin dynamics that SGD would induce in such a system.
In general A is not injective, and we can define the “A-complexity” of any function f∈Im(A)⊂F as c(f)=Nlog2−log(|A−1(f)|). Then, the probability that we arrive at the macrostate f is going to be proportional to e−c(f)−βL(f).
When L is some kind of negative log-likelihood, this approximates Solomonoff induction in a tempered Bayes paradigm—we raise likelihood ratios to a power β≠1 - insofar as the A-complexity c(f) is a good approximation for the Kolmogorov complexity of the function f, which will happen if the function approximator defined by A is sufficiently well-behaved.
The intuition for why we would expect (3) to be true in practice has to do with the nature of the function approximator A. When c(f) is small, it probably means that we only need a small number of bits of information on top of the definition of A itself to define f, because “many” of the possible parameter values for A are implementing the function f. So f is probably a simple function.
On the other hand, if f is a simple function and A is sufficiently flexible as a function approximator, we can probably implement the functionality of f using only a small number of the N bits in the codomain of A, which leaves us the rest of the bits to vary as we wish. This makes |A−1(f)| quite large, and by extension the complexity c(f) quite small.
The vague concept of “flexibility” mentioned in the paragraph above requires A to have singularities of many effective dimensions, as this is just another way of saying that the image of A has to contain functions with a wide range of A-complexities. If A is a one-to-one function, this clean version of the theory no longer works, though if A is still “close” to being singular (for instance, because many of the functions in its image are very similar) then we can still recover results like the one I mentioned above. The basic insights remain the same in this setting.
I’m wondering what singular learning theory experts have to say about this simplification of their theory. Is this explanation missing some important details that are visible in the full theory? Does the full theory make some predictions that this simplified story does not make?
My impression of singular learning theory
Disclaimer: I’m by no means an expert on singular learning theory and what I present below is a simplification that experts might not endorse. Still, I think it might be more comprehensible for a general audience than going into digressions about blowing up singularities and birational invariants.
Here is my current understanding of what singular learning theory is about in a simplified (though perhaps more realistic?) discrete setting.
Suppose you represent a neural network architecture as a map A:2N→F where 2={0,1}, 2N is the set of all possible parameters of A (seen as floating point numbers, say) and F is the set of all possible computable functions from the input and output space you’re considering. In thermodynamic terms, we could identify elements of 2N as “microstates” and the corresponding functions that the NN architecture A maps them to as “macrostates”.
Furthermore, suppose that F comes together with a loss function L:F→R evaluating how good or bad a particular function is. Assume you optimize L using something like stochastic gradient descent on the function L with a particular learning rate.
Then, in general, we have the following results:
SGD defines a Markov chain structure on the space 2N whose stationary distribution is proportional to e−βL(A(θ)) on parameters θ for some positive constant β>0 that depends on the learning rate. This is just a basic fact about the Langevin dynamics that SGD would induce in such a system.
In general A is not injective, and we can define the “A-complexity” of any function f∈Im(A)⊂F as c(f)=Nlog2−log(|A−1(f)|). Then, the probability that we arrive at the macrostate f is going to be proportional to e−c(f)−βL(f).
When L is some kind of negative log-likelihood, this approximates Solomonoff induction in a tempered Bayes paradigm—we raise likelihood ratios to a power β≠1 - insofar as the A-complexity c(f) is a good approximation for the Kolmogorov complexity of the function f, which will happen if the function approximator defined by A is sufficiently well-behaved.
The intuition for why we would expect (3) to be true in practice has to do with the nature of the function approximator A. When c(f) is small, it probably means that we only need a small number of bits of information on top of the definition of A itself to define f, because “many” of the possible parameter values for A are implementing the function f. So f is probably a simple function.
On the other hand, if f is a simple function and A is sufficiently flexible as a function approximator, we can probably implement the functionality of f using only a small number of the N bits in the codomain of A, which leaves us the rest of the bits to vary as we wish. This makes |A−1(f)| quite large, and by extension the complexity c(f) quite small.
The vague concept of “flexibility” mentioned in the paragraph above requires A to have singularities of many effective dimensions, as this is just another way of saying that the image of A has to contain functions with a wide range of A-complexities. If A is a one-to-one function, this clean version of the theory no longer works, though if A is still “close” to being singular (for instance, because many of the functions in its image are very similar) then we can still recover results like the one I mentioned above. The basic insights remain the same in this setting.
I’m wondering what singular learning theory experts have to say about this simplification of their theory. Is this explanation missing some important details that are visible in the full theory? Does the full theory make some predictions that this simplified story does not make?