Lucius Bushnaq comments on Lucius Bushnaq’s Shortform

Lucius Bushnaq 23 Oct 2024 8:50 UTC
28 points
7
Does the Solomonoff Prior Double-Count Simplicity?

Question: I’ve noticed what seems like a feature of the Solomonoff prior that I haven’t seen discussed in any intros I’ve read. The prior is usually described as favoring simple programs through its exponential weighting term, but aren’t simpler programs already exponentially favored in it just through multiplicity alone, before we even apply that weighting?

Consider Solomonoff induction applied to forecasting e.g. a video feed of a whirlpool, represented as a bit string $x$ . The prior probability for any such string is given by:
$P (x) = \sum p : U (p) = x \frac{1}{2^{| p |}}$
where $p$ ranges over programs for a prefix-free Universal Turing Machine.

Observation: If we have a simple one kilobit program $p_{1}$ that outputs prediction $x_{1}$ , we can construct nearly $2^{1000}$ different two kilobit programs that also output $x_{1}$ by appending arbitrary “dead code” that never executes.

For example:
DEADCODE=”[arbitrary 1 kilobit string]”
[original 1 kilobit program $p_{1}$ ]
EOF
Where programs aren’t allowed to have anything follow EOF, to ensure we satisfy the prefix free requirement.

If we compare $p_{1}$ against another two kilobit program $p_{2}$ outputting a different prediction $x_{2}$ , the prediction $x_{1}$ from $p_{1}$ would get $2^{1000 - | G |}$ more contributions in the sum, where $| G |$ is the very small number of bits we need to delimit the DEADCODE garbage string. So we’re automatically giving $x_{1}$ ca. $2^{1000}$ higher probability – even before applying the length penalty $\frac{1}{2^{| p |}}$ . $p_{1}$ has less ‘burdensome details’, so it has more functionally equivalent implementations. Its predictions seem to be exponentially favored in proportion to its length $| p_{1} |$ already due to this multiplicity alone.

So, if we chose a different prior than the Solomonoff prior which just assigned uniform probability to all programs below some very large cutoff, say $10^{90}$ bytes:
$P (x) = \sum p : U (p) = x, | p | \leq 10^{90} \frac{1}{2^{10^{90}}}$

and then followed the exponential decay of the Solomonoff prior for programs longer than $10^{90}$ bytes, wouldn’t that prior act barely differently than the Solomonoff prior in practice? It’s still exponentially preferring predictions with shorter minimum message length.^[1]

Am I missing something here?
1. ^
  Context for the question: Multiplicity of implementation is how simpler hypotheses are favored in Singular Learning Theory despite the prior over neural network weights usually being uniform. I’m trying to understand how those SLT statements about neural networks generalising relate to algorithmic information theory statements about Turing machines, and Jaynes-style pictures of probability theory.
- samshap 23 Oct 2024 12:17 UTC
  5 points
  −3
  Parent
  Yes, you are missing something.
  Any DEADCODE that can be added to a 1kb program can also be added to a 2kb program. The net effect is a wash, and you will end up with a $2^{1000}$ ratio over priors
  - Lucius Bushnaq 23 Oct 2024 12:46 UTC
    4 points
    0
    Parent
    Why aren’t there 2^{1000} less programs with such dead code and a total length below 10^{90} for p_2, compared to p_1?
    - samshap 24 Oct 2024 0:48 UTC
      1 point
      0
      Parent
      There are, but what does having a length below 10^90 have to do with the solomonoff prior? There’s no upper bound on the length of programs.
- Dalcy 23 Oct 2024 20:52 UTC
  3 points
  0
  Parent
  https://www.lesswrong.com/posts/KcvJXhKqx4itFNWty/k-complexity-is-silly-use-cross-entropy-instead
  The K-complexity of a function is the length of its shortest code. But having many many codes is another way to be simple! Example: gauge symmetries in physics. Correcting for length-weighted code frequency, we get an empirically better simplicity measure: cross-entropy.
  However:
  [this] is a well-known notion in algorithmic information theory, and differs from K-complexity by at most a constant
  - Lucius Bushnaq 23 Oct 2024 21:06 UTC
    6 points
    0
    Parent
    Sure. But what’s interesting to me here is the implication that, if you restrict yourself to programs below some maximum length, weighing them uniformly apparently works perfectly fine and barely differs from Solomonoff induction at all.
    This resolves a remaining confusion I had about the connection between old school information theory and SLT. It apparently shows that a uniform prior over parameters (programs) of some fixed size parameter space is basically fine, actually, in that it fits together with what algorithmic information theory says about inductive inference.
- harfe 23 Oct 2024 14:42 UTC
  3 points
  0
  Parent
  I think you are broadly right.
  
  So we’re automatically giving $x_{1}$ ca. $2^{1000}$ higher probability – even before applying the length penalty $\frac{1}{2^{| p |}}$ .
  
  But note that under the Solomonoff prior, you will get another $2^{- 2000 - | G |}$ penalty for these programs with DEADCODE. So with this consideration, the weight changes from $2^{- 1000}$ (for normal $p_{1}$ ) to $2^{- 1000} (1 + 2^{- | G |})$ (normal $p_{1}$ plus $2^{1000}$ DEADCODE versions of $p_{1}$ ), which is not a huge change.
  
  For your case of “uniform probability until $10^{90}$ ” I think you are right about exponential decay.
  - Lucius Bushnaq 23 Oct 2024 14:52 UTC
    4 points
    0
    Parent
    Yes, my point here is mainly that the exponential decay seems almost baked into the setup even if we don’t explicitly set it up that way, not that the decay is very notably stronger than it looks at first glance.
    
    Given how many words have been spilled arguing over the philosophical validity of putting the decay with program length into the prior, this seems kind of important?
    - Richard_Kennaway 23 Oct 2024 16:03 UTC
      3 points
      0
      Parent
      The number of programs of length at most n increases exponentially with n. Therefore any probability measure over them must decrease at least exponentially with length. That is, exponential decay is the least possible penalisation of length.
      
      This is also true of the number of minimal programs of length at most n, hence the corresponding conclusion. (Proof: for each string S, consider the minimal program that writes S and halts. These programs are all different. Their sizes are no more than length(S)+c, where c is the fixed overhead of writing a program with S baked into it. Therefore exponentiality.)
      
      I’ve written “at most n” instead of simply “n”, to guard against quirks like a programming language in which all programs are syntactically required to e.g. have even length, or deep theorems about the possible lengths of minimal programs.