Does the Solomonoff Prior Double-Count Simplicity?
Question: I’ve noticed what seems like a feature of the Solomonoff prior that I haven’t seen discussed in any intros I’ve read. The prior is usually described as favoring simple programs through its exponential weighting term, but aren’t simpler programs already exponentially favored in it just through multiplicity alone, before we even apply that weighting?
Consider Solomonoff induction applied to forecasting e.g. a video feed of a whirlpool, represented as a bit string x. The prior probability for any such string is given by: P(x)=∑p:U(p)=x12|p| where p ranges over programs for a prefix-free Universal Turing Machine.
Observation: If we have a simple one kilobit program p1 that outputs prediction x1, we can construct nearly 21000 different two kilobit programs that also output x1 by appending arbitrary “dead code” that never executes.
For example: DEADCODE=”[arbitrary 1 kilobit string]” [original 1 kilobit program p1] EOF Where programs aren’t allowed to have anything follow EOF, to ensure we satisfy the prefix free requirement.
If we compare p1 against another two kilobit program p2 outputting a different prediction x2, the prediction x1 from p1 would get 21000−|G| more contributions in the sum, where |G| is the very small number of bits we need to delimit the DEADCODE garbage string. So we’re automatically giving x1 ca. 21000 higher probability – even before applying the length penalty 12|p|. p1 has less ‘burdensome details’, so it has more functionally equivalent implementations. Its predictions seem to be exponentially favored in proportion to its length |p1| already due to this multiplicity alone.
So, if we chose a different prior than the Solomonoff prior which just assigned uniform probability to all programs below some very large cutoff, say 1090 bytes: P(x)=∑p:U(p)=x,|p|≤1090121090
and then followed the exponential decay of the Solomonoff prior for programs longer than 1090 bytes, wouldn’t that prior act barely differently than the Solomonoff prior in practice? It’s still exponentially preferring predictions with shorter minimum message length.[1]
Context for the question: Multiplicity of implementation is how simpler hypotheses are favored in Singular Learning Theory despite the prior over neural network weights usually being uniform. I’m trying to understand how those SLT statements about neural networks generalising relate to algorithmic information theory statements about Turing machines, and Jaynes-style pictures of probability theory.
Any DEADCODE that can be added to a 1kb program can also be added to a 2kb program. The net effect is a wash, and you will end up with a 21000 ratio over priors
The K-complexity of a function is the length of its shortest code. But having many many codes is another way to be simple! Example: gauge symmetries in physics. Correcting for length-weighted code frequency, we get an empirically better simplicity measure: cross-entropy.
Sure. But what’s interesting to me here is the implication that, if you restrict yourself to programs below some maximum length, weighing them uniformly apparently works perfectly fine and barely differs from Solomonoff induction at all.
This resolves a remaining confusion I had about the connection between old school information theory and SLT. It apparently shows that a uniform prior over parameters (programs) of some fixed size parameter space is basically fine, actually, in that it fits together with what algorithmic information theory says about inductive inference.
So we’re automatically giving x1 ca. 21000 higher probability – even before applying the length penalty 12|p|.
But note that under the Solomonoff prior, you will get another 2−2000−|G| penalty for these programs with DEADCODE. So with this consideration, the weight changes from
2−1000 (for normal p1) to 2−1000(1+2−|G|) (normal p1 plus 21000 DEADCODE versions of p1), which is not a huge change.
For your case of “uniform probability until 1090” I think you are right about exponential decay.
Yes, my point here is mainly that the exponential decay seems almost baked into the setup even if we don’t explicitly set it up that way, not that the decay is very notably stronger than it looks at first glance.
Given how many words have been spilled arguing over the philosophical validity of putting the decay with program length into the prior, this seems kind of important?
The number of programs of length at most n increases exponentially with n. Therefore any probability measure over them must decrease at least exponentially with length. That is, exponential decay is the least possible penalisation of length.
This is also true of the number of minimal programs of length at most n, hence the corresponding conclusion. (Proof: for each string S, consider the minimal program that writes S and halts. These programs are all different. Their sizes are no more than length(S)+c, where c is the fixed overhead of writing a program with S baked into it. Therefore exponentiality.)
I’ve written “at most n” instead of simply “n”, to guard against quirks like a programming language in which all programs are syntactically required to e.g. have even length, or deep theorems about the possible lengths of minimal programs.
Does the Solomonoff Prior Double-Count Simplicity?
Question: I’ve noticed what seems like a feature of the Solomonoff prior that I haven’t seen discussed in any intros I’ve read. The prior is usually described as favoring simple programs through its exponential weighting term, but aren’t simpler programs already exponentially favored in it just through multiplicity alone, before we even apply that weighting?
Consider Solomonoff induction applied to forecasting e.g. a video feed of a whirlpool, represented as a bit string x. The prior probability for any such string is given by:
P(x)=∑p:U(p)=x12|p|
where p ranges over programs for a prefix-free Universal Turing Machine.
Observation: If we have a simple one kilobit program p1 that outputs prediction x1, we can construct nearly 21000 different two kilobit programs that also output x1 by appending arbitrary “dead code” that never executes.
For example:
DEADCODE=”[arbitrary 1 kilobit string]”
[original 1 kilobit program p1]
EOF
Where programs aren’t allowed to have anything follow EOF, to ensure we satisfy the prefix free requirement.
If we compare p1 against another two kilobit program p2 outputting a different prediction x2, the prediction x1 from p1 would get 21000−|G| more contributions in the sum, where |G| is the very small number of bits we need to delimit the DEADCODE garbage string. So we’re automatically giving x1 ca. 21000 higher probability – even before applying the length penalty 12|p|. p1 has less ‘burdensome details’, so it has more functionally equivalent implementations. Its predictions seem to be exponentially favored in proportion to its length |p1| already due to this multiplicity alone.
So, if we chose a different prior than the Solomonoff prior which just assigned uniform probability to all programs below some very large cutoff, say 1090 bytes:
P(x)=∑p:U(p)=x,|p|≤1090121090
and then followed the exponential decay of the Solomonoff prior for programs longer than 1090 bytes, wouldn’t that prior act barely differently than the Solomonoff prior in practice? It’s still exponentially preferring predictions with shorter minimum message length.[1]
Am I missing something here?
Context for the question: Multiplicity of implementation is how simpler hypotheses are favored in Singular Learning Theory despite the prior over neural network weights usually being uniform. I’m trying to understand how those SLT statements about neural networks generalising relate to algorithmic information theory statements about Turing machines, and Jaynes-style pictures of probability theory.
Yes, you are missing something.
Any DEADCODE that can be added to a 1kb program can also be added to a 2kb program. The net effect is a wash, and you will end up with a 21000 ratio over priors
Why aren’t there 2^{1000} less programs with such dead code and a total length below 10^{90} for p_2, compared to p_1?
There are, but what does having a length below 10^90 have to do with the solomonoff prior? There’s no upper bound on the length of programs.
https://www.lesswrong.com/posts/KcvJXhKqx4itFNWty/k-complexity-is-silly-use-cross-entropy-instead
However:
Sure. But what’s interesting to me here is the implication that, if you restrict yourself to programs below some maximum length, weighing them uniformly apparently works perfectly fine and barely differs from Solomonoff induction at all.
This resolves a remaining confusion I had about the connection between old school information theory and SLT. It apparently shows that a uniform prior over parameters (programs) of some fixed size parameter space is basically fine, actually, in that it fits together with what algorithmic information theory says about inductive inference.
I think you are broadly right.
But note that under the Solomonoff prior, you will get another 2−2000−|G| penalty for these programs with DEADCODE. So with this consideration, the weight changes from 2−1000 (for normal p1) to 2−1000(1+2−|G|) (normal p1 plus 21000 DEADCODE versions of p1), which is not a huge change.
For your case of “uniform probability until 1090” I think you are right about exponential decay.
Yes, my point here is mainly that the exponential decay seems almost baked into the setup even if we don’t explicitly set it up that way, not that the decay is very notably stronger than it looks at first glance.
Given how many words have been spilled arguing over the philosophical validity of putting the decay with program length into the prior, this seems kind of important?
The number of programs of length at most n increases exponentially with n. Therefore any probability measure over them must decrease at least exponentially with length. That is, exponential decay is the least possible penalisation of length.
This is also true of the number of minimal programs of length at most n, hence the corresponding conclusion. (Proof: for each string S, consider the minimal program that writes S and halts. These programs are all different. Their sizes are no more than length(S)+c, where c is the fixed overhead of writing a program with S baked into it. Therefore exponentiality.)
I’ve written “at most n” instead of simply “n”, to guard against quirks like a programming language in which all programs are syntactically required to e.g. have even length, or deep theorems about the possible lengths of minimal programs.