Simple versus Short: Higher-order degeneracy and error-correction
TLDR: The simplicity bias in Bayesian statistics is not just a bias towards short description length.
The folklore relating the simplicity bias in Bayesian statistics to description length is incomplete: while it is true that the fewer parameters you use the better, the true complexity measure which appears in the mathematical theory of Bayesian statistics (that is, singular learning theory) is more exotic. The content of this complexity measure remains quite mysterious, but in this note we point out that in a particular setting it includes a bias towards runtime error-correction. This suggests caution when reasoning about the role of inductive biases in neural network training.
Acknowledgements. Thanks to Jesse Hoogland, Liam Carroll, Rumi Salazar and Simon Pepin Lehalleur for comments.
1. Background
1.1 Relevance to Deep Learning
Consider the problem of solving an ordinary differential equation. A constructive proof involves actually writing down a solution, or an algorithm that in finite time will produce a solution. The Picard-Lindelöf theorem proves that a solution to a broad class of initial value problems exists, but the proof is not constructive: it sets up a contraction mapping on a complete metric space and appeals to the Banach fixed point theorem.
While the Picard-Lindelöf theorem uniquely characterises the solution as the fixed point of a contraction mapping, and gives an iterative process for approximating the solution, it does not construct the solution. However a construction is not necessary for many of the applications of Picard-Lindelöf (in differential geometry, topology and many parts of analysis). This mode of reasoning about mathematical objects, where it suffices to have characterised[1] them by (universal) properties, is pervasive in modern mathematics (in the above example, the characterising property is the differential equation, or its associated contraction mapping). However this may seem quite alien to a computer scientist or programmer, who for historical reasons tend to think that there is only one mode of reasoning about mathematical objects, and that is centred on the study of a construction.
In an era where programs are increasingly the product of gradient descent rather than human construction, this attitude is untenable. We may have to accept a mode of reasoning about learned programs, based on understanding the nature of the problems to which they are a solution and the iterative processes that produce them. To understand the implicit algorithms learned by neural networks, it may be necessary from this perspective to understand
the computational structures latent in the data distribution, and
the inductive biases of neural network training.
We do not currently have a good understanding of these matters. If we understood these inductive biases better, it could conceivably help us in the context of AI alignment to answer questions like “how likely is deceptive alignment”, “how likely is consequentialism”, and “what goals are instrumentally convergent”?
This note is about the inductive biases of the Bayesian learning process (conditioned on more samples, the posterior increasingly localises around true parameters). Since Bayesian statistics is both fundamental and theoretically tractable, this seems potentially useful for understanding the inductive biases of neural network training. However it is worth noting that the relation between these is not understood at present.
1.2 Singular Learning Theory
The asymptotic expansion of the Bayesian free energy, or “free energy formula″, proven by Watanabe in Singular Learning Theory (SLT) introduces the learning coefficient as a measure of complexity that balances model accuracy in the process of model selection.
In models that are regular or minimally singular the learning coefficient is easy to understand: it is the effective number of parameters in the model, in a suitable sense. More precisely if a parameter is irrelevant, in the sense that any variation does not change the prediction, this leads to a decrease of in the learning coefficient and thus agrees with an effective parameter count in minimally singular models.[2] Therefore, in these cases, the tradeoff between model accuracy and complexity embodied in the minimisation of the free energy is familiar.
However, in more degenerate models such as neural networks, this connection between the learning coefficient and parameter counting breaks down. In particular, the learning coefficient depends on the data distribution. This is not an obstacle to theoretically deriving or empirically estimating the learning coefficient, but it does mean that we may lack good intuitions for what this (positive, rational) number is measuring.
1.3 Minimum Description Length
There is a circle of ideas containing Occam’s razor, Kolmogorov complexity and the Minimum Description Length (MDL) which strongly informs the intuitions in the machine learning community about the meaning of “simplicity” and its role in determining the inductive biases of neural network training. However it is important to note that the mathematical hypotheses required for the attractive coherence among these ideas do not apply to neural networks.
A good reference for the classical treatment of the MDL is
Grünwald, P. D. (2007). The Minimum Description Length Principle, MIT press.
For the relation of the MDL to Bayesian model selection and the asymptotic expansion of the free energy (in the minimally singular case) see
Balasubramanian, V. (1997). Statistical inference, Occam’s razor, and statistical mechanics on the space of probability distributions, Neural Computation, 9(2), 349-368.
There has been some attempt in recent years to address the fact that the classical treatment of MDL is not applicable to singular models like neural networks:
Dwevedi, R., Singh, C., Yu, B., Wainwright, M., (2023), Revisiting minimum description length complexity in overparametrized models.
It seems that using SLT one could give a generally correct treatment of MDL. However, until such results are established, one should not presume any fundamental theoretical connection between the simplicity bias of Bayesian statistics of neural networks and any simple notion of “description length”.
2. Turing Machines With Errors
In this section we give an example where the complexity measure in the free energy asymptotic for singular statistical models (the learning coefficient) is sensitive to algorithmic structure that goes beyond the number of effective parameters.
The example is derived from a literature that has attempted to bridge semantics of logic with statistical learning theory, based on ground-breaking work of Ehrhard-Regnier on differential linear logic. We have tried to make the presentation here self-contained, but the reader can find more information in the following references:
Clift, J., (2017), Turing Machines and Differential Linear Logic, MSc thesis.
Clift, J., Murfet, D., Wallbridge, J. (2021), Geometry of Program Synthesis.
Waring, T., (2021), Geometric Perspectives on Program Synthesis and Semantics, MSc thesis.
We consider the problem of predicting the outputs of a computable generating process by finding codes for a Universal Turing Machine (UTM). We have in mind a standard kind of UTM with a description tape (where we write the code which specifies the machine to be simulated), a state tape (where the state of the simulated machine is written) and a work tape (containing the state of the tape of the simulated machine). Some input sequence is written to the work tape, a code is written to the description tape, an initial state is written to the state tape, and then the UTM proceeds to simulate the Turing Machine with code until it halts with the output on the work tape, if it halts.
We can consider this process as actually instantiated in a machine or, more abstractly, an automata. The role of error in such processes is very interesting and leads ultimately to the problem of run-time error correction in modern computing: every step of communication or processing in a computer involves some probability of error and, although small, the large number of steps and size of the messages involved means that some error correction may be necessary for meaningful computation to take place.
We suppose that some computable process in the environment is outputting strings according to some true distribution , with each symbol having some small error from the output of a TM . The problem of statistical inference is to figure out which TM it is, from a given set of input-output pairs . Of course there will be (infinitely) many TMs consistent with the set of samples , so we expect that the result of statistical inference will be a probability distribution over codes. This is what we call the Bayesian posterior . We might bias this towards shorter machines by putting a prior over the space of models (codes).
This is an old problem, but we consider it from a new angle. Rather than considering a discrete space of codes with a special prior, we extend our set of allowed models of the generating process beyond TMs (as represented by their codes) to include codes with error channels. In just the same way that each “organ″ of the automata[3] in von Neumann’s work may have some probability of error, we allow each symbol of the code of to have some probability of error when it is read by the UTM. The specification of the allowed distribution of errors for each symbol in defines a point in the space of codes-with-error-channels, which we take as the parameter space of our new extended statistical model.
Given one of these codes-with-error-channels and an input sequence the contents of the work tape of after some given number of steps depends on the errors encountered during the execution of the UTM; that is, it is a probability distribution . This is our model. Those parameters which lead to final work tapes close to the true output , in the sense that the KL divergence between the two probability distributions is small, are more highly preferred by the Bayesian posterior , which is now defined on the extended space of models .
2.1 Error as a Probe of Structure
Of course, if the generating process is computable then its distribution of outputs can be realised by a model where has no errors (for example, the code of ). So what is there to be gained by considering this larger space of models?
The distribution over outputs that results from a given distribution of possible errors in reading a symbol of the code reflects the way that the symbol is used in the computation. While this is clear informally (to understand how something works, try perturbing one of its components and see what happens) this can be given a formal interpretation in the context of differential linear logic, where there is a connection between this sensitivity analysis and the Taylor series expansion of a proof. The simplest example is that if the introduction of errors in does not affect the output distribution at all then it is reasonable to conclude that is irrelevant to the computation. This already suggests a natural connection between the local learning coefficient and the program length of whereby the posterior tends to concentrate around shorter programs which predict the dataset to the same degree of accuracy. We will turn to more interesting examples soon.
The upshot is that given a TM with code the way that the probability distribution varies when we vary near (noting that ) encodes a lot of information about the structure of the algorithm when executed on the input , via the effects of perturbing each one of the bits in the code . This variation in turn is reflected in the local geometry of the function
near , of which the local learning coefficient is a scalar invariant. The conclusion is that the algebraic geometry of the function germ should be expected to reflect the internal structure of the TM to some degree. At the moment it remains unclear how strong this connection between geometry of and internal structure of actually is.
2.2 The Example
We fix an input for which the true output is . For simplicity we assume this is a single symbol and that we judge models on their predictions for the contents of this single tape square. The contribution of the data sample to the KL divergence can be shown to be equivalent[4] to where
which is a polynomial in the coefficients specifying the probability distributions that make up . Here is the Kronecker delta which is if and zero otherwise.
Let us now consider the local behaviour of (equivalently ) at a point of parameter space which represents a TM with code . We fix a single symbol of this code and a perturbation of this code in some direction . That is, we consider a parameter which agrees with except that in the th position we replace the symbol by the distribution . For concreteness let us take the original symbol of the code to be and take to be the formal linear combination of symbols so that for small is a probability distribution representing a small chance of reading a instead of a . Then we can expand in powers of
where each is a function of only (since we assume there is no uncertainty in the other positions of ). Taking we obtain which we may assume is zero; that is, we assume that gives the correct answer for the fixed input . Thus and we have
We are interested in the order of this polynomial, that is the least such that .
To go further we have to think in more detail about how the UTM computes the probability that is, how the uncertainty in the code is propagated to uncertainty in the output. Since we run the simulated machine for some fixed number of steps, the UTM itself is run for some number of steps, and thus makes use of exactly samples from the th symbol of the code for some that is independent of . The possible trajectories of are thus dictated by the sequences of possible results and to each we assign a probability
In our example, the distribution in at the th position is
so that we need only consider sequences where samples appear, and
where is the number of times is sampled in and .
One can compute that when you run the UTM as above the resulting probability distribution over outputs is
where the sum is over all executions of making use of a sequence of samples that, on input , leads to the output symbol equal to . Here is defined by running the UTM and intervening in any step where the UTM goes to read from the th symbol of the code, so that on the th read for the UTM sees the symbol .
Combining the above equations
which is zero since when we have and so the first factor in every summand vanishes. Thus . For
For
Recall that is where is the number of ’s appearing in the sequence . Thus is the number of s sampled in the sequence , that is, the number of errors. The derivative is nonzero only when which for is only possible (given that and ) if . The only “paths″ or sequences that contribute to the -fold derivative are ones which contain errors. Substituting in the above yields
where we have used that
We can use this formula to compute the coefficient .
Definition. We say that the Turing Machine with code is robust to errors on input in position if, for any execution path of the UTM initialised with on the description tape and on the work tape involving errors, .
Lemma. If is robust to errors on input in position then for . That is, the order of in is strictly greater than .
Proof. We set . If is robust to errors in the stated sense, then it is clear from the above that when and since it is never true that when is such that . In the case where and
Hence for any and any we have . Thus for
since every summand involves a term with .
Example. Since any TM is robust to zero errors, we always have . If is robust to one error on input then , so the polynomial is at least cubic in .
If is robust to errors on every input then has order in , from which we infer that has order in locally at . If are the only allowed symbols in the code at this position then the local learning coefficient satisfies .
2.3 Summary
The example illustrates one simple way in which the structure of a TM with code (in this case, the capability to recover from read errors in its specification during run-time) influences the geometry of the function germ . We expect there are many further forms of degeneracy, which for example involve sophisticated interactions between bits of the code.
In a minimally singular model every used parameter costs (in the sense that it increases by that amount) and so our intuition might suggest that in the current setting every used bit in the code should cost . It is true that every used bit costs at most but we have just seen that not all bits are created equal: a bit which is “error-corrected” in the sense that when the UTM executes it is robust to errors in that bit, only costs . The principle of minimisation of free energy therefore suggests that, all else being equal, the Bayesian posterior will prefer codes where the bits are error-corrected in this way. That is, two codes of equal length and matching the outputs of the generating process equally well, may nonetheless have neighbourhoods assigned different probability by the posterior, if one of them has error-correction and the other does not. Thus “simplicity″ (in the sense of ) includes robustness to errors.
We note that it is straightforward to provide a TM with run-time error-correction at the cost of execution time, by running the program multiple times and applying a majority vote to determine the answer. More sophisticated schemes are possible, and there is a literature that has worked out various ways of doing this.
3. Conclusion
The relation between description length and simplicity biases in Bayesian statistics is well-known, but is a phenomena that is confined to regular models and this class of models does not include neural networks. We do not yet possess conceptually simple general intuitions about the inductive biases of Bayesian statistics. In this note we have exhibited one case in which the simplicity bias is more exotic.
3.1 Questions and Open Problems
The analysis in this note is preliminary. The set of people working on ideas like this is small, and if you have relevant background in mathematics or ML you could probably figure out something useful. Notes:
It is not known whether the inductive bias of neural network training contains a preference for run-time error-correction. The phenomenon of “backup heads” observed in transformers seems like a good candidate. Can you think of others?
It seems that in deep RL, policies which take effective control of the environment might be able to achieve a kind of run-time error-correction which allows them to simplify their policies and thus minimise the free energy. This might lead to a connection between simplicity in Bayesian statistics and the emergence of goal-directedness.
Simon Pepin Lehalleur (henceforth SPLH) asks “It would be remarkable if some related fact or downstream consequence of this observation had not been observed somewhere in the literature on error-correction and information theory?” There is an extensive literature on error-correction in naturally occurring computational systems. Interesting observations can be found in “Noisy dynamical systems evolve error correcting codes and modularity” by T. McCourt et al, “Resource savings from fault-tolerant circuit design” by A. K. Tan and I. L. Chuang, “Biological error correction codes generate fault-tolerant neural networks” by A. Zlokapa et al. I have only a superficial familiarity with the literature, but it seems to me one could make a career out of bringing modern mathematical techniques to bear on this field.
SPLH suggests the modest open problem of proving this is a consequence of SLT :)
- ^
Simon Pepin Lehalleur says: “defined implicitly” is another common way to get at the same idea. Statistics is in some sense all about implicit definitions (“Statistics is the inverse problem to probability”).
- ^
In the case where the KL divergence or loss function is locally, after a change of variables, a sum of squares in then since changing each of the for increases the loss, we refer to these parameters as “relevant” whereas changing each of the for does not change the loss so we refer to these parameters as “irrelevant”. More precisely, there is a finite range within which these parameters can be varied without changing the loss.
- ^
By an “organ” von Neumann means a fundamental unit, from which more complicated computations can be built.
- ^
In the sense that there exist such that with we have . In particular the local learning coefficients of and agree.
- AXRP Episode 31 - Singular Learning Theory with Daniel Murfet by 7 May 2024 3:50 UTC; 72 points) (
- 11 Mar 2024 13:38 UTC; 2 points) 's comment on Many arguments for AI x-risk are wrong by (
[minor]
I’m a bit confused or I think you need some additional caveats in the intro. I would have said that Bayesian statistics with typical models is well understood as bias toward short description length but there are important caveats in the neural network case, at least conceptually.
(That said, minimum description length is off by constant factor, but this doesn’t seem to be what you’re getting at.)
You say:
And
But later you say:
Maybe I can clarify a few points here:
A statistical model is regular if it is identifiable and the Fisher information matrix is everywhere nondegenerate. Statistical models where the prediction involves feeding samples from the input distribution through neural networks are not regular.
Regular models are the ones for which there is a link between low description length and low free energy (i.e. the class of models which the Bayesian posterior tends to prefer are those that are assigned lower description length, at the same level of accuracy).
It’s not really accurate to describe regular models as “typical”, especially not on LW where we are generally speaking about neural networks when we think of machine learning.
It’s true that the example presented in this post is, potentially, not typical (it’s not a neural network nor is it a standard kind of statistical model). So it’s unclear to what extent this observation generalises. However, it does illustrate the general point that it is a mistake to presume that intuitions based on regular models hold for general statistical models.
A pervasive failure mode in modern ML is to take intuitions developed for regular models, and assume they hold “with some caveats” for neural networks. We have at this point many examples where this leads one badly astray, and in my opinion the intuition I see widely shared here on LW about neural network inductive biases and description length falls into this bucket.
I don’t claim to know the content of those inductive biases, but my guess is that it is much more interesting and complex than “something like description length”.
I’ve heard thirdhand (?) of a transformer whose sublayers L(h)=o will dampen their outputs when o is added to that sublayer’s input. IE there might be a “target” amount of o to have in the residual stream after that sublayer, and the sublayer itself somehow responds to ensure that happens?
If there was some abnormality and there was already a bunch of o present, then the sublayer “error corrects” by shrinking its output.
I’m surprised to see an application of the Banach fixed-point theorem as an example of something that’s too implicit from the perspective of a computer scientist. After all, real quantities can only be represented in a computer as a sequence of approximations — and that’s exactly what the theorem provides.
I would have expected you to use, say, the Brouwer fixed-point theorem instead, because Brouwer fixed points can’t be computed to arbitrary precision in general.
(I come from a mathematical background, fwiw.)
This comment of mine is a bit cheeky, since there are plenty of theoretical computer scientists who think about characterising terms as fixed points, and logic programming is a whole discipline that is about characterising the problem rather than constructing a solution, but broadly speaking I think it is true among less theoretically-minded folks that “program” means “thing constructed step by step from atomic pieces”.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?