I’m kind of puzzled by the amount of machinery that seems to be going into these arguments, because it seems to me that there is a discrete analog of the same arguments which is probably both more realistic (as neural networks are not actually continuous, especially with people constantly decreasing the precision of the floating point numbers used in implementation) and simpler to understand.
Suppose you represent a neural network architecture as a map A:2N→F where 2={0,1} and F is the set of all possible computable functions from the input and output space you’re considering. In thermodynamic terms, we could identify elements of 2N as “microstates” and the corresponding functions that the NN architecture A maps them to as “macrostates”.
Furthermore, suppose that F comes together with a loss function L:F→R evaluating how good or bad a particular function is. Assume you optimize L using something like stochastic gradient descent on the function L with a particular learning rate.
Then, in general, we have the following results:
SGD defines a Markov chain structure on the space 2N whose stationary distribution is proportional to e−βL(A(θ)) on parameters θ for some positive constant β>0. This is just a basic fact about the Langevin dynamics that SGD would induce in such a system.
In general A is not injective, and we can define the “A-complexity” of any function f∈Im(A)⊂F as c(f)=Nlog2−log(|A−1(f)|). Then, the probability that we arrive at the macrostate f is going to be proportional to e−c(f)−βL(f).
When L is some kind of negative log likelihood, this approximates Solomonoff induction in a tempered Bayes paradigm insofar as the A-complexity c(f) is a good approximation for the Kolmogorov complexity of the function f, which will happen if the function approximator defined by A is sufficiently well-behaved.
Is there some additional content of singular value theory that goes beyond the above insights?
Edit: I’ve converted this comment to a post, which you can find here.
I’m kind of puzzled by the amount of machinery that seems to be going into these arguments, because it seems to me that there is a discrete analog of the same arguments which is probably both more realistic (as neural networks are not actually continuous, especially with people constantly decreasing the precision of the floating point numbers used in implementation) and simpler to understand.
Suppose you represent a neural network architecture as a map A:2N→F where 2={0,1} and F is the set of all possible computable functions from the input and output space you’re considering. In thermodynamic terms, we could identify elements of 2N as “microstates” and the corresponding functions that the NN architecture A maps them to as “macrostates”.
Furthermore, suppose that F comes together with a loss function L:F→R evaluating how good or bad a particular function is. Assume you optimize L using something like stochastic gradient descent on the function L with a particular learning rate.
Then, in general, we have the following results:
SGD defines a Markov chain structure on the space 2N whose stationary distribution is proportional to e−βL(A(θ)) on parameters θ for some positive constant β>0. This is just a basic fact about the Langevin dynamics that SGD would induce in such a system.
In general A is not injective, and we can define the “A-complexity” of any function f∈Im(A)⊂F as c(f)=Nlog2−log(|A−1(f)|). Then, the probability that we arrive at the macrostate f is going to be proportional to e−c(f)−βL(f).
When L is some kind of negative log likelihood, this approximates Solomonoff induction in a tempered Bayes paradigm insofar as the A-complexity c(f) is a good approximation for the Kolmogorov complexity of the function f, which will happen if the function approximator defined by A is sufficiently well-behaved.
Is there some additional content of singular value theory that goes beyond the above insights?
Edit: I’ve converted this comment to a post, which you can find here.