jake_mendel comments on My Criticism of Singular Learning Theory

jake_mendel 22 Feb 2024 16:23 UTC
LW: 3 AF: 3
2
AF
I think I agree that SLT doesn’t offer an explanation of why NNs have a strong simplicity bias, but I don’t think you have provided an explanation for this either?
Here’s a simple story for why neural networks have a bias to functions with low complexity (I think it’s just spelling out in more detail your proposed explanation):
Since the Kolmogorov complexity of a function $f (x)$ is (up to a constant offset) equal to the minimum description length of the function, it is upper bounded by any particular way of describing the function, including by first specifying a parameter-function map, and then specifying the region of parameter space corresponding to the function. That means:
$K (f) \leq ℓ (M) + ℓ (f | M) + O (1)$
where $ℓ (M)$ is the minimum description length of the parameter function map, $ℓ (f | M)$ is the minimum description length required to specify $f$ given $M$ , and the $O (1)$ term comes from the fact that K complexity is only defined up to switching between UTMs. Specifying $f$ given $M$ entails specifying the region of parameter space $W_{f}$ corresponding to $f$ defined by $W_{f} = {w | M (w) = f} .$ Since we can use each bit in our description of $f$ to divide the parameter space in half, we can upper bound the mdl of $f$ given $M$ by $ℓ (f | M) \leq - {log}_{2} | W_{f} | + {log}_{2} | W | + O (1)$ ^[1] where $| W |$ denotes the size of the overall parameter space. This means that, at least asymptotically in $K (f)$ , we arrive at
$| W_{f} | \leq 2^{- K (f) + O (1)} .$
This is (roughly) a hand-wavey version of the Levin Coding Theorem (a good discussion can be found here). If we assume a uniform prior over parameter space, then $ϕ (f) \leq 2^{- K (f) + O (1)}$ . In words, this means that the prior assigned by the parameter function map to complex functions must be small. Now, the average probability assigned to each function in the set of possible outputs of the map is $1 / N$ where $N$ is the number of functions. Since there are $2^{K_{max}}$ functions with K complexity at most $K_{max}$ , the highest K complexity of any function in the model must be at least ${log}_{2} N$ so, for simple parameter function maps, the most complex function in the model class must be assigned prior probability less than or equal to the average prior. Therefore if the parameter function map assigns different probabilities to different functions, at all, it must be biased against complex functions (modulo the $O (1)$ term)!
But, this story doesn’t pick out deep neural network architectures as better parameter function maps than any other. So what would make a parameter function map bad? Well, for a start the $O (1)$ term includes $ℓ (M)$ — we can always choose a pathologically complicated parameter function map which specifically chooses some specific highly complex functions to be given a large prior by design. But even ignoring that, there are still low complexity maps that have very poor generalisation, for example polyfits. That’s because the expression we derived is only an upper bound: there is no guarantee that this bound should be tight for any particular choice of parameter-function map. Indeed, for a wide range of real parameter function maps, the tightness of this bound can vary dramatically:
This figure (from here) shows scatter plots of (an upper bound estimate of) the K complexity of a large set of functions, against the prior assigned to them by a particular choice of param function map.
It seems then that the question of why neural network architectures have a good simplicity bias compared to other architectures is not about why they do not assign high volume/prior to extremely complicated functions — since this is satisfied by all simple parameter function maps — but why there are not many simple functions that they do not assign high prior to relative to other parameter-function maps — why the bottom left of these plots is less densely occupied, or occupied with less ‘useful’ functions, for NN architectures than other architectures. Of course, we know that there are simple functions that the NN inductive bias hates (for example simple functions with a for loop cannot be expressed easily by a feed forward NN), but we’d like to explain why they have fewer ‘blind spots’ than other architectures. Your proposed solution doesn’t address this part of the question I think?
Where SLT fits in is to provide a tool for quantifying $| W_{f} |$ for any particular $f$ . That is, SLT provides a sort of ‘cause’ for how different functions occupy regions of parameter space of different sizes: namely that the size of $W_{f}$ can be measured by counting a sort of effective number of parameters present in a particular choice $w \in W_{f}$ ^[2]. Put another way, SLT says that if you specify $W_{f}$ by using each bit in your description to cut $W$ in half, then it will sort-of take $^λ (w_{f}^{*})$ bits (the local learning coefficient at the most singular point in parameter space that maps to $f$ ) to describe $W$ , so $K (f) \leq κ^λ (w_{f}^{*})$ for some constant $κ$ that is independent of $f$ .
So your explanation says that any parameter function map is biased to low complexity functions, and SLT contributes a way to estimate the size of the parameter space assigned to a particular function, but neither addresses the question of why neural networks have a simplicity bias that is stronger than other parameter function maps.
1. ^
  Actually, I am pretty unsure how to do this properly. It seems like the number of bits required to specify that a point is inside some region in a space really ought to depend only on the fraction of the space occupied by the region, but I don’t know how to ensure this in general—I’d be keen to know how to do this. For example, if I have a 2d parameter space (bounded, so a large square), and $W_{1}$ is a random $10 \times 10$ square, $W_{2}$ is a union of 100 randomly placed $1 \times 1$ squares, does it take the same number of bits to find my way into either (remember, I don’t need to fully describe the region, just specify that I am inside it)? Or even more simply, if $W_{3}$ is the set of points within distance $δ$ of the line $y = 5$ , I can specify I am within the region by specifying the $y$ coordinate up to resolution $δ$ , so $ℓ (W_{3}) = - log δ + O (1)$ . If $W_{4}$ is the set of points within distance $δ$ of the line $y = x$ , how do I specify that I am within $W_{4}$ in a number of bits that is asymptotically equal to $ℓ (W_{3})$ as $δ \to 0$ ?
2. ^
  In fact, we might want to say that at some imperfect resolution/finite number of datapoints, we want to treat a set of very similar functions as the same, and then the best point in parameter space to count effective parameters at is a point that maps to the function which gets the lowest loss in the limit of infinite data.