[I learned this argument from Lucius Bushnaq and Matthias Dellago. It is also latent already in Solomonoff’s original work]
Consider binary strings of a fixed length {0,1}L
Imagine feeding these strings into some turing machine; we think of strings as codes for a function. Suppose we have a function that can be coded by a short compressed string s of length k<<L. That is, the function is computable by a small program.
Imagine uniformly sampling a random code for {0,1}L . What number of the codes implement the same function as the string s? It’s close to 2L−k. Indeed, given the string s of length k we can ‘pad’ it to a string of length L by writing the code
“run s skip t ”
where t is an arbitrary string of length L−k−c where c is a small constant accounting for the overhead. There are approximately 2L−k of such binary strings. If our programming language has a simple skip / commenting out functionality then we expect approximately 2L−k codes encoding the same function as s. The fraction of all codes encoding s is 2^-k.
I find this truly remarkable: the degeneracy or multiplicity is inversely exponentially proportional to the minimum description length of the function!
Just by sampling codes uniformly at random we get the Simplicity prior!!
Why do Neural Networks work? Why do polynomials not work?
It is sometimes claimed that neural networks work well because they are ‘Universal Approximators’. There are multiple problems with this explanation, see e.g. here but a very basic problem is that being a universal approximaton is very common. Polynomials are universal approximators!
Many different neural network architectures work. In the limit of large data, compute the difference of different architectures start to vanish and very general scaling laws dominate. This is not the case for polynomials.
Degeneracy=Simplicity explains why: polynomials are uniquely tied down by their coefficients, so a learning machine that tries to fit polynomials is does not have a ‘good’ simplicity bias that approximates the Solomonoff prior.
The lack of degeneracy applies to any set of functions that form an orthogonal basis. This is because the decomposition is unique. So there is no multiplicity and no implicit regularization/ simplicity bias.
[I learned this elegant argument from Lucius Bushnaq.]
The Singular Learning Theory and Algorithmic Information Theory crossover
I described the padding argument as an argument not a proof. That’s because technically it only gives a lower bound on the number of codes equivalent to the minimal description code. The problem is there are pathological examples where the programming language (e.g. the UTM) hardcodes that all small codes s encode a single function f.
When we take this problem into account the Padding Argument is already in Solomonoff’s original work. There is a theorem that states that the Solomonoff prior is equivalent to taking a suitable Universal Turing Machine and feeding in a sequence of (uniformly) random bits and taking the resulting distribution. To account for the pathological examples above everything is asymptotic and up to some constant like all results in algorithmic information theory. This means that like all other results in algorithmic information theory it’s unclear whether it is at all relevant in practice.
However, while this gives a correct proof I think this understates the importance of the Padding argument to me. That’s because I think in practice we shouldn’t expect the UTM to be pathological in this way. In other words, we should heuristically expect the simplicity K(f) to be basically proportional to the fraction of codes yielding f for a large enough (overparameterized) architecture.
The bull case for SLT is now: there is a direct equality between algorithmic complexity and the degeneracy. This has always been SLT dogma of course but until I learned about this argument it wasn’t so clear to me how direct this connection was. The algorithmic complexity can be usefully approximated by the (local) learning coefficient λ!
EDIT: see Clift-Murfet-Wallbridge and Tom Warings thesis for more. See below, thanks Dan
The bull case for algorithmic information: the theory of algorithmic information, Solomonoff induction, AIXI etc is very elegant and in some sense gives answers to fundamental questions we would like to answer. The major problem was that it is both uncomputable and seemingly intractable. Uncomputability is perhaps not such a problem—uncomputability often arises from measure zero highly adversarial examples. But tractability is very problematic. We don’t know how tractable compression is, but it’s likely untractable. However, the Padding argument suggests that we should heuristically expect the simplicity K(f) to be basically proportional to the fraction of codes yielding f for a large enough (overparameterized) architecture—in other words it can be measured by the local Learning coefficient.
Do Neural Networks actually satisfy the Padding argument?
Short answer: No.
Long answer: Unclear. maybe… sort of… and the difference might itself be very interesting...!
For those interested, a continuous version of the padding argument is used in Theorem 4.1 of Clift-Murfet-Wallbridge to show that the learning coefficient is a lower bound on the Kolmogorov complexity (in a sense) in the setting of noisy Turing machines. Just take the synthesis problem to be given by a TM’s input-output map in that theorem. The result is treated in a more detailed way in Waring’s thesis (Proposition 4.19). Noisy TMs are of course not neural networks, but they are a place where the link between the learning coefficient in SLT and algorithmic information theory has already been made precise.
For what it’s worth, as explained in simple versus short, I don’t actually think the local learning coefficient is algorithmic complexity (in the sense of program length) in neural networks, only that it is a lower bound. So I don’t really see the LLC as a useful “approximation” of the algorithmic complexity.
For those wanting to read more about the padding argument in the classical setting, Hutter-Catt-Quarel “An Introduction to Universal Artificial Intelligence” has a nice detailed treatment.
I agree neural networks probably don’t actually satisfy the padding argument on the nose and agree that the exact degeneracy is quite interesting (as I say at the end of the op).
I do think for large enough overparameterization the padding argument suggests the LLC might come close to the K-complexity in many cases. But more interestingly to me is that the padding argument doesn’t really require the programming language to be Turing-complete. In those cases the degeneracy will be proportional to complexity/simplicity measures that are specific to the programming language (/architecture class).
Inshallah I will get to writing something about that soon.
The Padding Argument or Simplicity = Degeneracy
[I learned this argument from Lucius Bushnaq and Matthias Dellago. It is also latent already in Solomonoff’s original work]
Consider binary strings of a fixed length {0,1}L
Imagine feeding these strings into some turing machine; we think of strings as codes for a function. Suppose we have a function that can be coded by a short compressed string s of length k<<L. That is, the function is computable by a small program.
Imagine uniformly sampling a random code for {0,1}L . What number of the codes implement the same function as the string s? It’s close to 2L−k. Indeed, given the string s of length k we can ‘pad’ it to a string of length L by writing the code
“run s skip t ”
where t is an arbitrary string of length L−k−c where c is a small constant accounting for the overhead. There are approximately 2L−k of such binary strings. If our programming language has a simple skip / commenting out functionality then we expect approximately 2L−k codes encoding the same function as s. The fraction of all codes encoding s is 2^-k.
I find this truly remarkable: the degeneracy or multiplicity is inversely exponentially proportional to the minimum description length of the function!
Just by sampling codes uniformly at random we get the Simplicity prior!!
Why do Neural Networks work? Why do polynomials not work?
It is sometimes claimed that neural networks work well because they are ‘Universal Approximators’. There are multiple problems with this explanation, see e.g. here but a very basic problem is that being a universal approximaton is very common. Polynomials are universal approximators!
Many different neural network architectures work. In the limit of large data, compute the difference of different architectures start to vanish and very general scaling laws dominate. This is not the case for polynomials.
Degeneracy=Simplicity explains why: polynomials are uniquely tied down by their coefficients, so a learning machine that tries to fit polynomials is does not have a ‘good’ simplicity bias that approximates the Solomonoff prior.
The lack of degeneracy applies to any set of functions that form an orthogonal basis. This is because the decomposition is unique. So there is no multiplicity and no implicit regularization/ simplicity bias.
[I learned this elegant argument from Lucius Bushnaq.]
The Singular Learning Theory and Algorithmic Information Theory crossover
I described the padding argument as an argument not a proof. That’s because technically it only gives a lower bound on the number of codes equivalent to the minimal description code. The problem is there are pathological examples where the programming language (e.g. the UTM) hardcodes that all small codes s encode a single function f.
When we take this problem into account the Padding Argument is already in Solomonoff’s original work. There is a theorem that states that the Solomonoff prior is equivalent to taking a suitable Universal Turing Machine and feeding in a sequence of (uniformly) random bits and taking the resulting distribution. To account for the pathological examples above everything is asymptotic and up to some constant like all results in algorithmic information theory. This means that like all other results in algorithmic information theory it’s unclear whether it is at all relevant in practice.
However, while this gives a correct proof I think this understates the importance of the Padding argument to me. That’s because I think in practice we shouldn’t expect the UTM to be pathological in this way. In other words, we should heuristically expect the simplicity K(f) to be basically proportional to the fraction of codes yielding f for a large enough (overparameterized) architecture.
The bull case for SLT is now: there is a direct equality between algorithmic complexity and the degeneracy. This has always been SLT dogma of course but until I learned about this argument it wasn’t so clear to me how direct this connection was. The algorithmic complexity can be usefully approximated by the (local) learning coefficient λ!
EDIT: see Clift-Murfet-Wallbridge and Tom Warings thesis for more. See below, thanks Dan
The bull case for algorithmic information: the theory of algorithmic information, Solomonoff induction, AIXI etc is very elegant and in some sense gives answers to fundamental questions we would like to answer. The major problem was that it is both uncomputable and seemingly intractable. Uncomputability is perhaps not such a problem—uncomputability often arises from measure zero highly adversarial examples. But tractability is very problematic. We don’t know how tractable compression is, but it’s likely untractable. However, the Padding argument suggests that we should heuristically expect the simplicity K(f) to be basically proportional to the fraction of codes yielding f for a large enough (overparameterized) architecture—in other words it can be measured by the local Learning coefficient.
Do Neural Networks actually satisfy the Padding argument?
Short answer: No.
Long answer: Unclear. maybe… sort of… and the difference might itself be very interesting...!
Stay tuned.
Re: the SLT dogma.
For those interested, a continuous version of the padding argument is used in Theorem 4.1 of Clift-Murfet-Wallbridge to show that the learning coefficient is a lower bound on the Kolmogorov complexity (in a sense) in the setting of noisy Turing machines. Just take the synthesis problem to be given by a TM’s input-output map in that theorem. The result is treated in a more detailed way in Waring’s thesis (Proposition 4.19). Noisy TMs are of course not neural networks, but they are a place where the link between the learning coefficient in SLT and algorithmic information theory has already been made precise.
For what it’s worth, as explained in simple versus short, I don’t actually think the local learning coefficient is algorithmic complexity (in the sense of program length) in neural networks, only that it is a lower bound. So I don’t really see the LLC as a useful “approximation” of the algorithmic complexity.
For those wanting to read more about the padding argument in the classical setting, Hutter-Catt-Quarel “An Introduction to Universal Artificial Intelligence” has a nice detailed treatment.
Thank you for the references Dan.
I agree neural networks probably don’t actually satisfy the padding argument on the nose and agree that the exact degeneracy is quite interesting (as I say at the end of the op).
I do think for large enough overparameterization the padding argument suggests the LLC might come close to the K-complexity in many cases. But more interestingly to me is that the padding argument doesn’t really require the programming language to be Turing-complete. In those cases the degeneracy will be proportional to complexity/simplicity measures that are specific to the programming language (/architecture class). Inshallah I will get to writing something about that soon.
The sentence seems cut off.